Reading Group Feature Selection - SS 07

target group: (PhD-)students and staff with some ML background
time and place: Wednesday, 11:30-12:30 at E302 or a nearby seminar room
contact: weizsaec@rbg.informatik.tu-darmstadt.de

Introduction

Feature selection is the task of choosing an subset S of input variables/features that yields high performance when used by some learning algorithm. This explanation is somehow vague.

Representation of the superset: Let V be the set that contains all variables considered. Is V given explicitly, such that we can do search on it by locking at any member in turn, or is it only an abstract set from which we can generate members but where exhaustive search is unfeasible?
Division of Feature Selection and Learning: Doesn't the selection of useful variables and learning weights for a model using these variables belong together?

We do not intend to clarify the notion feature selection. Instead, we want to review work carrying that label in order to learn what selection criteria and techniques had been proposed and how they relate to learning scheme in line. As a starting point we take the Special Issue on Variable and Feature Selection

of which a survey is given in

[GuyonE05]: An Introduction to Variable and Feature Selection, Isabelle Guyon, André Elisseeff, JMLR Special Issue on Variable and Feature Selection (Kernel Machines Section), 2003.

Further, we want to contrast these techniques to those that are used in symbolic rule learning algorithms, as rules can also be taken as binary variables.

Selection of Relevant Features and Examples in Machine Learning , Avrim L. Blum, Pat Langley, 1997

Schedule

Date	Reading	Presenter
16.5.2007	Intro
23.5.2007	Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy	Sang-Hyeun
30.5.2007	Searching for Interacting Features	Nik
6.6.2007	Distributional Word Clusters vs. Words for Text Categorization	Eneldo
13.6.2007	Ranking a Random Feature for Variable and Feature Selection	Frederik
20.6.2007	Semi-supervised Feature Selection via Spectral Analysis Selection	Lorenz

Systematics following [GuyonE05]

For an first overview we make the following distinction of feature selection approaches, where we follow the survey paper [GuyonE2005].

filters - selection schemes that do not depend on nor use any predictor
wrappers - approximated optimization of subsets w.r.t. the cost function of the predictor
embedded - integration of feature selection criteria into the cost function of the learner

Reading Proposals

further proposals welcome

Filters

[DhillonMK03]: A Divisive Information Theoretic Feature Clustering Algorithm for Text Classification, Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar, 2003. About: Feature creation by distributional clustering based information theoretic relations, applied to (hierarchical) text classification. See [GuyonE05] section 5.1, see also Bekkerman et al. 2003.

[StoppigliaDDO03]: Ranking a Random Feature for Variable and Feature Selection, Hervé Stoppiglia, Gérard Dreyfus, Rémi Dubois, Yacine Oussar, 2003. About: Combines feature ranking via Gram-Schmid on example space with setting random probe in order get the right cut. See [GuyonE05], section 4.2.

[ZhengL07]: Searching for Interacting Features, Z. Zhao and H. Liu, 2007. About: find subsets of features that have joint relevance (..).

[ZhengL07b]: Semi-supervised Feature Selection via Spectral Analysis, Z. Zhao and H. Liu, 2007. About: Graphtheory based ranking of features. If the image of a point in the example space w.r.t. to the root of the Laplacian of the similarity matrix has great L_2 norm, the corresponding feature is considered to fit less to the given similarities and is therefore less relevant.

Filters/Wrappers

[PengLD05]: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, Peng, H.C., Long, F., and Ding, C., 2005. About: Section 3 gives an analytical discussion of the selection criteria mentioned in the title. Later they show how they can efficiently be used for wrapping techniques.

Embedded

[WestonEST03]: Use of the Zero Norm with Linear Models and Kernel Methods, Jason Weston, André Elisseeff, Bernhard Schölkopf, Mike Tipping, 2003. About: Prototype embedded selection where in SVM the L_2 regularization is (approximately) replaced by L_0, e.g. by counting the non-zeros.

[PerkinsLT03]: Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space, Simon Perkins, Kevin Lacker, James Theiler, 2003. About: Embedded selection based on sensitivity of some elaborated cost function to changes of so far unused features weights.

[McCallum03]: Efficiently Inducing Features of Conditional Random Fields. Andrew McCallum, 2003. About: Embedded forward selection based on the conditional likelihood of the CRF-model as trained so far.