Towards a Synthesis of Local and Global Pattern Discovery
Topic: This project will develop a unifying view on methods for local and global pattern discovery in databases.
Code Name: GLocSyn
Duration: 2 years (09/2006--08/2008)
Extension: 2 years (09/2008--08/2010)
2. Extension: 2 years (09/2010--06/2013)
Funding: DFG
Goals of this project
Rule learning has been a core research area in machine learning and data mining since the early days of the fields. Research in this area progresses along two different lines. One is automated induction of global patterns, usually rule sets, which provide explanations for all possible examples in a target domain. The other is the discovery of local patterns, usually single rules, that capture interesting anomalies in parts of the data. While there is an obvious relation between both research areas - global rule sets are composed of individual rules that may be interpreted as local patterns - so far both fields have developed fairly independently. In particular, there is no systematic exploration of their relation in the literature.
The main goal of the work in this project is to increase our understanding of the interplay between local and global rule induction. Local patterns can be combined into global patterns with various variants of the so-called covering strategy. For this purpose, local patterns must be evaluated with a suitable trade-off between their interestingness (typically related to the deviation from the distribution of a random rule) and their coverage (typically measured in the number of examples covered by a rule). However, it is still unclear
- how a suitable trade-off between coverage and precision can be determined
- what criteria optimize this trade-off
- whether these criteria can be optimized using heuristic search
- what criteria can be defined for non-classification tasks
- what are suitable techniques for combining local patterns into a global pattern
Within this project, we on the one hand intend to give solid theoretical and empirical answers to these questions and on the other hand develop a suite of practical rule learning algorithms whose generality goes far beyond the currently available systems. With this general goal in mind, we intend to achieve the following subgoals within this project:
- evaluate existing metrics for evaluating local patterns in the context of global pattern discovery
- clearly formulate the local pattern discovery task as a search problem and develop suitable search heuristics for this task
- study the quality of solutions found by efficient heuristic search in comparison to the exhaustive search algorithms that currently prevail in local pattern discovery
- understand the effects of different strategies for combining local patterns into global patterns
- re-formulate rule-based regression and clustering as local pattern discovery problems and develop suitable algorithms for finding such patterns
- identify a set of conceptual building blocks that allow to describe existing and configure novel global and local rule learning algorithms at an abstract level
- design and implement a powerful, freely available software package that integrates local and global rule induction into a single framework
- empirically evaluate the developed techniques on real-world data sets from the social sciences
- disseminate the project results widely, not only via academic writings, but also try to reach potential industrial partners via workshops and tutorials at national and international conferences
- outline a textbook on rule-based machine learning and data mining, based on material that will be developed during the course of this project
The long-term goal of our research is to profile our group as a competence center for inductive rule learning. Eventually, we intend to develop rule-based solutions for the entire spectrum of learning tasks that are typically encountered in practical applications, and to employ these techniques in close co-operation with industrial partners. We are convinced that a wide dissemination of the framework, methods, and software developed in this project will become a suitable vehicle for establishing this goal.
We would nevertheless like to stress that we expect that our results will not only be relevant for rule learning, but will apply to arbitrary types of local patterns, because the strategies for evaluating local patterns and combining them into a global theory that we will study in this project do not presume that the patterns in question are formulated in the form of rules.