Reading Group Feature Selection - SS 07
- target group: (PhD-)students and staff with some ML background
- time and place: Wednesday, 11:30-12:30 at E302 or a nearby seminar room
- contact: weizsaec@rbg.informatik.tu-darmstadt.de
Introduction
Feature selection is the task of choosing an subset S of input variables/features that yields high performance when used by some learning algorithm. This explanation is somehow vague.
- Representation of the superset: Let V be the set that contains all variables considered. Is V given explicitly, such that we can do search on it by locking at any member in turn, or is it only an abstract set from which we can generate members but where exhaustive search is unfeasible?
- Division of Feature Selection and Learning: Doesn't the selection of useful variables and learning weights for a model using these variables belong together?
of which a survey is given in
- [GuyonE05]: An Introduction to Variable and Feature Selection, Isabelle Guyon, André Elisseeff, JMLR Special Issue on Variable and Feature Selection (Kernel Machines Section), 2003.
Further, we want to contrast these techniques to those that are used in symbolic rule learning algorithms, as rules can also be taken as binary variables.
- Selection of Relevant Features and Examples in Machine Learning , Avrim L. Blum, Pat Langley, 1997
Schedule
Date | Reading | Presenter |
---|---|---|
16.5.2007 | Intro | |
23.5.2007 | Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy | Sang-Hyeun |
30.5.2007 | Searching for Interacting Features | Nik |
6.6.2007 | Distributional Word Clusters vs. Words for Text Categorization | Eneldo |
13.6.2007 | Ranking a Random Feature for Variable and Feature Selection | Frederik |
20.6.2007 | Semi-supervised Feature Selection via Spectral Analysis Selection | Lorenz |
Systematics following [GuyonE05]
For an first overview we make the following distinction of feature selection approaches, where we follow the survey paper [GuyonE2005].- filters - selection schemes that do not depend on nor use any predictor
- wrappers - approximated optimization of subsets w.r.t. the cost function of the predictor
- embedded - integration of feature selection criteria into the cost function of the learner
Reading Proposals
- further proposals welcome
Filters
- [DhillonMK03]: A Divisive Information Theoretic Feature Clustering Algorithm for Text Classification, Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar, 2003. About: Feature creation by distributional clustering based information theoretic relations, applied to (hierarchical) text classification. See [GuyonE05] section 5.1, see also Bekkerman et al. 2003.
- [StoppigliaDDO03]: Ranking a Random Feature for Variable and Feature Selection, Hervé Stoppiglia, Gérard Dreyfus, Rémi Dubois, Yacine Oussar, 2003. About: Combines feature ranking via Gram-Schmid on example space with setting random probe in order get the right cut. See [GuyonE05], section 4.2.
- [ZhengL07]: Searching for Interacting Features, Z. Zhao and H. Liu, 2007. About: find subsets of features that have joint relevance (..).
- [ZhengL07b]: Semi-supervised Feature Selection via Spectral Analysis, Z. Zhao and H. Liu, 2007. About: Graphtheory based ranking of features. If the image of a point in the example space w.r.t. to the root of the Laplacian of the similarity matrix has great L_2 norm, the corresponding feature is considered to fit less to the given similarities and is therefore less relevant.
Filters/Wrappers
- [PengLD05]: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, Peng, H.C., Long, F., and Ding, C., 2005. About: Section 3 gives an analytical discussion of the selection criteria mentioned in the title. Later they show how they can efficiently be used for wrapping techniques.
Embedded
- [WestonEST03]: Use of the Zero Norm with Linear Models and Kernel Methods, Jason Weston, André Elisseeff, Bernhard Schölkopf, Mike Tipping, 2003. About: Prototype embedded selection where in SVM the L_2 regularization is (approximately) replaced by L_0, e.g. by counting the non-zeros.
- [PerkinsLT03]: Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space, Simon Perkins, Kevin Lacker, James Theiler, 2003. About: Embedded selection based on sensitivity of some elaborated cost function to changes of so far unused features weights.
- [McCallum03]: Efficiently Inducing Features of Conditional Random Fields. Andrew McCallum, 2003. About: Embedded forward selection based on the conditional likelihood of the CRF-model as trained so far.