未加星标

Using 3D visualizations to tune hyperparameters of ML models with Python

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二05 | 时间 2018 | 作者 红领巾 ] 0人收藏点击收藏
Using 3D visualizations to tune hyperparameters in MLmodels

The majority of this post is interactive visualizations you can hover on, zoom and move are. It’s better read on a computer than on your phone, but landscape mode on the phone will at least let you see the plots better than portrait mode.

Imagine that you’re trying to develop a solution to

Kaggle’s

Rossmann Store Sales

competition. You’ve done a lot of feature engineering and created a ton of new variables that may help you predict future sales better.

You’ve created a Random Forest and you’re trying to find its optimal hyperparameters. There’s like 1000+ possible combinations of them that you want to evaluate. You can run a randomized search to analyze only a subsample of them or grid search to explore the full grid of parameters.

Some of the parameters are evenly spaced on a log scale. You can do that with np.logspace

You do the latter and now have some data. With rf_gridsearch.best_params_ you can get the 3 parameters that yield the best results on the test set (max_features: 0.25, min_samples_split: 13, n_estimators: 45). But what if you want to visualize the performance of all of the random forests that were trained, across the 3 dimensions?

Wait, WTF do all those buzzwords mean?

A Decision Tree is a type of supervised machine learning algorithm that, given a dataset, divides it recursively into subsets that have a target variable more similar to each other. Given some new data with the kinds of independent variables used to train it, it can predict the dependent variable.


Using 3D visualizations to tune hyperparameters of ML models with Python
Using 3D visualizations to tune hyperparameters of ML models with Python
A decision tree used for regression. Credit: this great post from UCBerkeley

A Random Forest (RF from now on) is a collection of decision trees that are trained on a subset of the full training data and using a subset of the features. This allow them to have less correlated individual decision trees, so that they will generalize better and have less overfitting. They are faster to train than neural networks and are a pretty good and fast first attempt to solve classification and regression problems on structured data.

There are several hyperparameters that we can set in RFs. Read about them all on scikit-learn’s documentation . Some of the most important ones are:

n_estimators: Number of trees in the RF. min_samples_split: Minimum number of samples in a subset (aka node) to split it into two more subsets. Related to min_samples_leaf and to max_depth max_features : maximum number of features (independent variables) to consider when splitting a node

The complexity of the RF increases with higher n_estimators, max_features and lower min_samples_split.

Cross validationis a technique used to find the optimal hyperparameters in a machine learning model. To perform it we have to divide the data in 3 subsets: a train set (used to train the model), a validation set (to optimize the hyperparameters) and a test set (to check the performance of the model at the end as if we were in production already). We evaluate the performance of the model using some score that will vary depending on the type of problem we’re trying to solve (regression, classification, clustering…).


Using 3D visualizations to tune hyperparameters of ML models with Python
A pic is always worth 1000words!

For regression, R2 (R-squared) works well and is the one I used. Generally speaking, the more complex the model is, the better the score will be in the train set. On the test set, it’ll increase also with the model complexity, but after a certain point it won’t increase and it also could decrease. What we do with cross validation is trying to find that point.

OK lets get goin:fire:

In our case, the model complexity is a function of the 3 hyperparameters. What we’ll do next is trying to visualize where that best model is found in the 3D grid of hyperparameters and how all of the combinations performed. Note that in the plots I used the words test set and test score but what I’m referring to is validation.

First: 2DHeatmaps

As a quick approach, we could plot heatmaps with the possible pairs of different parameters to see the areas where the maximum R2 is achieved in the test and the train sets. This can be done in a couple of lines:


Using 3D visualizations to tune hyperparameters of ML models with Python
Using 3D visualizations to tune hyperparameters of ML models with Python
R2 scores for the testsets
Using 3D visualizations to tune hyperparameters of ML models with Python
Using 3D visualizations to tune hyperparameters of ML models with Python
R2 scores for the trainingsets

But doing this we will still only see a small fraction of the results. To see all of them (and in a clearer way than with heatmaps) we need to create 3D visualizations.

Scatter3D

With Plotly we can create nice and interactive visualizations of all kinds. First, let’s create a 3D scatter plot where the size of the points is proportional to the training time and the color is proportional to the R2 score in the test set.

See this in fullscreen mode here

Nice, but we can’t still see much. We can say that the bigger max_features and the smaller min_samples_split, the greater the test score is, but it’s difficult to hover on points that are in the middle of the 3D scatter plot.

The code to make this is too long for Medium, but you can see it all

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

代码区博客精选文章
分页:12
转载请注明
本文标题:Using 3D visualizations to tune hyperparameters of ML models with Python
本站链接:https://www.codesec.net/view/611191.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(74)