Feature Selection in Details

Published in

Netcetera Tech Blog

5 min readMay 15, 2020

Introduction

Feature selection is a method of selecting a subset of all features provided with observations data to build the optimal Machine Learning model. Well implemented feature selection leads to faster training and inference as well as better performing trained models.

Other benefits from proper feature selection are more straightforward interpretation of the results, reducing overfitting, reducing feature redundancy, and others.

Feature Selection General Methods

There are three main types of feature selection methods: filter methods, wrapper methods, and embedded methods: filter methods, wrapper methods, and embedded methods and hybrid features selection methods.

Filter Methods

Filter methods select features independently of the chosen Machine Learning training algorithm, i.e., they are model agnostic. They rely only on the characteristics of the data features contained.

The first thing to do is to remove irrelevant features, i.e., constant or quasi constant features that do not provide useful information for the observation.

After removing irrelevant features, the next step is to remove redundant features. Redundant features are duplicated or highly correlated features. Correlation among features in some chosen feature sets can be calculated with Pearson’s correlation coefficient, for ex., and each correlation with an absolute value that is higher than 0.8 refers to highly correlated features.

The final step is to rank remaining features according to information they provide about observations and choose an arbitrary number of top-ranked features. The critical aspect here is choosing appropriate ranking algorithms. Possible ones are the Chi-squared test, Fisher Score, univariate analysis, and univariant ROC-AUC value.

Wrapper Methods

The main goal of wrapper methods is to select the optimal subset of all available features that produce a trained model with the best performance.

To implement any of the wrapper methods, the machine learning algorithm has to be already chosen.

There are three types of wrapper methods that differentiate in the ways how they choose the optimal features' subset.

Wrapper Methods | Step Forward Feature Selection

Let say that N is the number of available features. Step forward wrapper method works on the following way:

Evaluate all subsets with one feature
Choose the one that provides the best performing trained model
Evaluate the performance of the models trained with all subsets that have two features where the first feature is already selected
Choose the optimal subset with two features with the best performance
Repeat until all the features are considered when the model is trained
From the resulting N subsets, choose one with the best performance of the trained model

Wrapper Methods | Step Backward Feature Selection

This method is similar to step forward feature selection, as follows:

The initial feature set is one that contains all the available features
N feature sets are generated by removing each of available features from that feature set, and these N subsets are evaluated with trained models
Evaluate the performance of the models trained with all subsets that contain N-1 features
Next feature set is chosen among N feature sets with N-1 features with the best results
Repeat until there is a feature set with only one feature
From the resulting N subsets, choose one with the best performance of the trained model.

Wrapper Methods | Exhaustive Feature Selection

This method trains models with all the possible subsets from the set of all available features. The subset with the best performance is the chosen one.

Wrapper Methods | Note

Let’s say that N = 20, i.e., there are 20 different categories in the observations set. The complexity of step forward and step backward methods is 20², and the complexity of exhaustive feature selection is 2²⁰. In all real-world scenarios, the number of all subsets from the set with N elements is enormous, therefore this method is rarely used in reality.

Embedded Methods | Introduction

Embedded methods perform feature selection during the model training process. These methods are embedded in the training algorithm as its primary or extended functionality.

The advantages of these methods are: faster than wrapper methods, more accurate than filter methods, and detect the interaction between features.

Procedure:

Train a machine learning algorithm
Derive the feature importance
Remove unimportant features

There are three types of embedded methods for feature selection: Lasso regularization, linear models, and trees.

Embedded Methods | Lasso Regularization

Lasso regularization is l1 regularization. It is included in the cost function that the training algorithm is trying to minimize, and implemented as a sum of training weights for all the features, multiplied by regularization parameter ƛ. For small values of ƛ, weight coefficients are bigger than zero. But if the value of ƛ increases, some weight coefficients will start to become 0, meaning that appropriate features are not crucial for the model. In this way, by gradually increasing the value of ƛ and training the model, the result of the training gives a higher and higher number of weights equal to 0. The process stops by choosing an arbitrary value of the maximum value of ƛ or when there is achieved an arbitrary number of features that can be removed from the observations set.

Embedded Methods | Regression Models

Feature selection using linear models assumes multivariant dependency of the target from values of available features, and values of available features are normally distributed. In such a case, the model is trained using logistic regression for classification or linear regression for regression. This training procedure results in values of the weights of the features involved. For more essential features, these weights are bigger, and vice versa. Chosen set of selected features is an arbitrary number of top-ranked features having their weights as ranking criteria.

Embedded Methods | Trees

The decision tree is a machine learning algorithm for both classification and regression, where the importance of each feature can be easily measured by the purity of each bucket of observations derived by the question related to the value of the feature. The feature that will be used for the observation bucket generation can be information gain, GINI index, or entropy, among other methods. The more the feature decreases impurity, the more relevant the feature is.

A variant of this method is random forest. Random forest is a set of decision trees (usually hundreds) with a random selection of subsets of available features. In this case, each feature’s importance is calculated as its average importance among all the trees in the forest.

Another variant is the recursive usage of random forests. After the first run of the random forest, few features will be removed as insignificant ones. Then a random forest is retrained, and the second subset of insignificant features is removed. This procedure is repeated until the specific criteria of the model performance are met.

The same feature selection methodology can be used with gradient boosting trees instead of random forests.

Conclusion

That should be all for feature selection.

https://www.udemy.com/course/feature-selection-for-machine-learning/ extensively covers all the mentioned topics, and it is highly recommended for all interested in implementing the full Machine Learning pipeline.

What Comes Next in Machine Learning Pipeline

The next step in the Machine Learning pipeline is Deployment of Machine Learning Models, and this topic will be covered in another post that follows.