Return to page


Feature Selection

What is Feature Selection?


Feature selection process is one of the main components of a feature engineering process. This is how a predictive model is developed by reducing the number of input variables.

Feature selection techniques are employed to reduce the number of input variables by eliminating redundant or irrelevant features. It then narrows the set of features to those most relevant to the machine learning model. A feature selection objective in machine learning identifies the most helpful group of features that can be used to build useful models of the phenomena being studied.

Examples of Feature Selection

Feature selection is an efficient preprocessing technique for various real-world applications, such as text categorization, remote sensing, image retrieval, microarray analysis, mass spectrum analysis, sequence analysis, etc.

Below are some real-life examples of feature selection:

  1. Mammographic image analysis
  2. Criminal behavior modeling
  3. Genomic data analysis
  4. Plat monitoring
  5. Mechanical integrity assessment
  6. Text clustering
  7. Hyperspectral image classification
  8. Sequence analysis

Why is  Feature Selection important?

In the machine learning process, feature selection is used to make the process more accurate. It also increases the prediction power of the algorithms by selecting the most critical variables and eliminating the redundant and irrelevant ones. This is why feature selection is important.

Three key benefits of feature selection are:

  1. Decreases over-fitting  
    Fewer redundant data means fewer chances of making decisions based on noise.
  2. Improves Accuracy  
    Less misleading data means better modeling accuracy.
  3. Reduces Training Time  
    Less data means quicker algorithms.

What are the three types of  Feature Selection methods?

Feature selection techniques can be divided into two types: supervised and unsupervised. Supervised methods may be divided into three types: wrapper methods (forward, backward, and stepwise selection), filter methods (ANOVA, Pearson correlation, variance thresholding), and embedded methods (Lasso, Ridge, Decision Tree).

Wrapper methods

Wrapper methods are used to train a model using a subset of features. According to the conclusions drawn from the previous model, we decide whether to include or exclude certain features from the subgroup. The problem is essentially reduced to a search problem and usually has a high computational cost.

For example, standard wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc.

  • Forward Selection
    In bold selection, start with no feature in the model and iterate forward. On each subsequent repetition, add the feature that improves the model the most until the addition of a new variable does not improve its performance.
  • Backward Elimination
    Start with all the features during backward elimination and remove the least significant feature at every iteration, which improves the model's performance. This process is repeated until no improvement is observed.
  • Recursive Feature Elimination
    The algorithm aims to find the best performing subset of features using greedy optimization. Each iteration creates a new model and keeps aside the best or worst performing features. The next model is constructed using the remaining features until all the features have been exhausted. The features are then ranked according to their elimination order.

With wrapper methods, it's best to use the Boruta package, which identifies the importance of a feature by creating shadow features.

Filter methods

Filter methods are generally used as preprocessing steps, and their selection is independent of any machine learning algorithm. Features are instead selected based on their scores in various statistical tests to determine their correlation with the outcome variable, which is subjective.

  • Pearson’s Correlation
    It is used to quantify linear dependence between two continuous variables, X and Y. Its value ranges from -1 to 1.
  • LDA
    A linear discriminant analysis determines a linear combination of features that characterizes or differentiates between a categorical variable’s two or more classes (or levels).
    ANOVA stands for analysis of variance. Its operation is similar to LDA, except it is based on one or more categorical independent features and one continuous dependent feature. The test determines if the means of several groups are equal or not.
  • Chi-Square
    This is a statistical test used to determine whether there is a correlation between categorical features based on their frequency distributions.

It is important to remember that filter methods do not remove multicollinearity. Before training models for your data, you must consider the multicollinearity of features.

Embedded methods

Embedded methods combine the best features of filtering and wrapping by implementing algorithms with built-in methods for selecting features.

For example, RIDGE and LASSO regression both have inbuilt penalization functions that can reduce overfitting.

  • The L1 regularization of Lasso regression adds a penalty equal to the absolute value of the coefficients' magnitude.
  • Ridge regression performs L2 regularization, which imposes a penalty equal to the square of the coefficients' magnitude.

Feature Selection vs. Other Technologies & Methodologies

Feature selection vs. extraction

The key difference between feature selection and extraction is that feature selection keeps a subset of the original features while feature extraction algorithms transform the data onto a new feature space. Some supervised algorithms already have built-in feature selection, such as Regularized Regression and Random Forests.

In summary:

Extraction: Getting useful features from existing data.

Selection: Choosing a subset of the original pool of features.

Feature selection vs. dimensionality reduction

A feature selection is simply selecting or excluding given features without modifying them in any way. Dimensionality reduction reduces the dimensions of features. By contrast, the set of features made by feature selection must be a subset of the original set of features. The set made by dimensionality reduction does not have to be (for example, PCA reduces dimensionality by making new synthetic features by linearly combining the original features, then discarding the less important ones). This way, feature selection is a special case of dimensionality reduction.