未加星标

Stack Abuse: Applying Wrapper Methods in Python for Feature Selection

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二04 | 时间 2018 | 作者 红领巾 ] 0人收藏点击收藏
Introduction

In theprevious article, we studied how we can use filter methods for feature selection for machine learning algorithms. Filter methods are handy when you want to select a generic set of features for all the machine learning models.

However, in some scenarios, you may want to use a specific machine learning algorithm to train your model. In such cases, features selected through filter methods may not be the most optimal set of features for that specific algorithm. There is another category of feature selection methods that select the most optimal features for the specified algorithm. Such methods are called wrapper methods .

Wrapper Methods for Feature Selection

Wrapper methods are based on greedy search algorithms as they evaluate all possible combinations of the features and select the combination that produces the best result for a specific machine learning algorithm. A downside to this approach is that testing all possible combinations of the features can be computationally very expensive, particularly if the feature set is very large.

As said earlier, wrapper methods can find the best set of features for a specific algorithm - however, a downside is that these set of features may not be optimal for every other machine learning algorithm.

Wrapper methods for feature selection can be divided into three categories: Step forward feature selection , Step backwards feature selection and Exhaustive feature selection . In this article, we will see how we can implement these feature selection approaches in python.

Step Forward Feature Selection

In the first phase of the step forward feature selection, the performance of the classifier is evaluated with respect to each feature. The feature that performs the best is selected out of all the features.

In the second step, the first feature is tried in combination with all the other features. The combination of two features that yield the best algorithm performance is selected. The process continues until the specified number of features are selected.

Let's implement step forward feature selection in Python. We will be using the BNP Paribas Cardif Claims Management dataset for this section as we did in our previous article.

To implement step forward feature selection, we need to convert categorical feature values into numeric feature values. However, for the sake of simplicity, we will remove all the non-categorical columns from our data. We will also remove the correlated columns as we did in the previous article so that we have a small feature set to process.

Data Preprocessing

The following script imports the dataset and the required libraries, it then removes the non-numeric columns from the dataset and then divides the dataset into training and testing sets. Finally, all the columns with a correlation of greater than 0.8 are removed. Take a look at this article for the detailed explanation of this script:

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.feature_selection import VarianceThreshold paribas_data = pd.read_csv(r"E:\Datasets\paribas_data.csv", nrows=20000) paribas_data.shape num_colums = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64'] numerical_columns = list(paribas_data.select_dtypes(include=num_colums).columns) paribas_data = paribas_data[numerical_columns] paribas_data.shape train_features, test_features, train_labels, test_labels = train_test_split( paribas_data.drop(labels=['target', 'ID'], axis=1), paribas_data['target'], test_size=0.2, random_state=41) correlated_features = set() correlation_matrix = paribas_data.corr() for i in range(len(correlation_matrix .columns)): for j in range(i): if abs(correlation_matrix.iloc[i, j]) > 0.8: colname = correlation_matrix.columns[i] correlated_features.add(colname) train_features.drop(labels=correlated_features, axis=1, inplace=True) test_features.drop(labels=correlated_features, axis=1, inplace=True) train_features.shape, test_features.shape Implementing Step Forward Feature Selection in Python

To select the most optimal features, we will be using SequentialFeatureSelector function from the mlxtend library. The library can be downloaded executing the following command at anaconda command prompt:

conda install -c conda-forge mlxtend

We will use the RandomForestClassifier to find the most optimal parameters. The evaluation criteria used will be ROC-AUC . The following script selects the 15 features from our dataset that yields best performance for random forest classifier:

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier from sklearn.metrics import roc_auc_score from mlxtend.feature_selection import SequentialFeatureSelector feature_selector = SequentialFeatureSelector(RandomForestClassifier(n_jobs=-1), k_features=15, forward=True, verbose=2, scoring='roc_auc', cv=4)

In the script above we pass the RandomForestClassifier as the estimator to the SequentialFeatureSelector function. The k_features specifies the number of features to select. You can set any number of features here. The forward parameter, if set to True , performs step forward feature selection. The verbose parameter is used for logging the progress of the feature selector, the scoring parameter defines the performance evaluation criteria and finally, cv refers to cross-validation folds.

We created our feature selector, now we need to call the fit method on our feature selector and pass it the training and test sets as shown below:

features = feature_selector.fit(np.array(train_features.fillna(0)), train_labels)

Depending upon your system hardware, the above script can take some time to execute. Once the above script finishes executing, you can execute the following script to see the 15 selected features:

filtered_features= train_features.columns[list(features.k_feature_idx_)] filtered_features

In the output, you should see the following features:

Index(['v4', 'v10', 'v14', 'v15', 'v18', 'v20', 'v23', 'v34', 'v38', 'v42', 'v50', 'v51', 'v69', 'v72', 'v129'], dtype='object')

Now to see the classification performance of the random forest algorithm using these 15 features, execute the following script:

clf = RandomForestClassifier(n_estimators=100, random_state=41, max_depth=3) clf.fit(train_features[filtered_features].fillna(0), train_labels) train_pred = clf.predict_proba(train_features[filtered_features].fillna(0)) print('Accuracy on training set: {}'.format(roc_auc_score(train_labels, train_pred[:,1]))) test_pred = clf.predict_proba(test_features[filtered_features].fillna(0)) print('Accuracy on test set: {}'.format(roc_auc_score(test_labels, test_pred [:,1]))) In the script above, we train our random forest algorithm on the 15 features that we selected using the step forward feature selection

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

代码区博客精选文章
分页:12
转载请注明
本文标题:Stack Abuse: Applying Wrapper Methods in Python for Feature Selection
本站链接:https://www.codesec.net/view/610695.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(79)