未加星标

Stacked Ensembles and Word2Vec now available in H2O!

字体大小 | |
[开发(python) 所属分类 开发(python) | 发布者 店小二03 | 时间 2017 | 作者 红领巾 ] 0人收藏点击收藏
Stacked Ensembles
Stacked Ensembles and Word2Vec now available in H2O!

H2O’s new Stacked Ensemble method is a supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking or “Super Learning.” This method currently supports regression and binary classification, and multiclass support is planned for a future release. A full list of the planned features for Stacked Ensemble can be viewed here .

H2O previously has supported the creation of ensembles of H2O models through a separate implementation, the h2oEnsemble R package , which is still available and will continue to be maintained , however for new projects we’d recommend using the native H2O version. Native support for stacking in the H2O backend brings support for ensembles to all the H2O APIs.

Creating ensembles of H2O models is now dead simple. You simply pass a list of existing H2O model ids to the stacked ensemble function and you are ready to go. This list of models can be a set of manually created H2O models, a random grid of models (of GBMs, for example), or set of grids of different algorithms. Typically, the more diverse the collection of base models, the better the ensemble performance. Thus, using H2O’s Random Grid Search to generate a collection of random models is a handy way of quickly generating a set of base models for the ensemble.

R:

ensemble <- h2o.stackedEnsemble(x = x, y = y, training_frame = train, base_models = my_models)

python:

ensemble = H2OStackedEnsembleEstimator(base_models=my_models) ensemble.train(x=x, y=y, training_frame=train)

Full R and Python code examples are available on the Stacked Ensemblesdocs page. Kagglers rejoice!

Stacking in native h2o now?? Can't... stop... kaggling... @h2oai #rstats #kaggle

― James (@Blair09M) February 7, 2017

Word2Vec
Stacked Ensembles and Word2Vec now available in H2O!

\(\)

H2O now has a full implementation of Word2Vec . Word2Vec is a group of related models that are used to produce word embeddings (a language modeling/feature engineering technique in natural language processing where words or phrases are mapped to vectors of real numbers). The word embeddings can subsequently be used in a machine learning model, for example, GBM. This allows user to utilize text based data with current H2O algorithms in a very efficient manner. An R example is available here .

Technical Details

H2O’s Word2Vec is based on the skip-gram model. The training objective of skip-gram is to learn word vector representations that are good at predicting its context in the same sentence. Mathematically, given a sequence of training words $w_1, w_2, \dots, w_T$, the objective of the skip-gram model is to maximize the average log-likelihood

$$\frac{1}{T} \sum_{t = 1}^{T}\sum_{j=-k}^{j=k} \log p(w_{t+j} | w_t)$$

where $k$ is the size of the training window.

In the skip-gram model, every word w is associated with two vectors $u_w$ and $v_w$ which are vector representations of $w$ as word and context respectively. The probability of correctly predicting word $w_i$ given word $w_j$ is determined by the softmax model, which is

$$p(w_i | w_j ) = \frac{\exp(u_{w_i}^{\top}v_{w_j})}{\sum_{l=1}^{V} \exp(u_l^{\top}v_{w_j})}$$

where $V$ is the vocabulary size.

The skip-gram model with softmax is expensive because the cost of computing $\log p(w_i | w_j)$ is proportional to $V$, which can be easily in order of millions. To speed up training of Word2Vec, we used hierarchical softmax, which reduced the complexity of computing of $\log p(w_i | w_j)$ to $O(\log(V))$

Tverberg Release (H2O 3.10.3.4)

Below is a detailed list of all the items that are part of the Tverberg release.

List of New Features:

PUBDEV-2058 - Implement word2vec in h2o (To use this feature in R, please visit this demo )

PUBDEV-3635 - Ability to Select Columns for PDP computation in Flow (With this enhancement, users will be able to select which features/columns to render Partial Dependence Plots from Flow. (R/Python supported already). Known issue PUBDEV-3782 : when nbins < categorical levels, PDP won't compute. Please visit alsothis post.)

PUBDEV-3881 - Add PCA Estimator documentation to Python API Docs

PUBDEV-3902 - Documentation: Add information about Azure support to H2O User Guide (Beta)

PUBDEV-3739 - StackedEnsemble: put ensemble creation into the back end.

List of Improvements:

PUBDEV-3989 - Decrease size of h2o.jar

PUBDEV-3257 - Documentation: As a K-Means user, I want to be able to better understand the parameters

PUBDEV-3741 - StackedEnsemble: add tests in R and Python to ensure that a StackedEnsemble performs at least as well as the base_models

PUBDEV-3857 - Clean up the generated Python docs

PUBDEV-3895 - Filter H2OFrame on pandas dates and time (python)

PUBDEV-3912 - Provide way to specify context_path via Python/R h2o.init methods

PUBDEV-3933 - Modify gen_R.py for Stacked Ensemble

PUBDEV-3972 - Add Stacked Ensemble code examples to Python docstrings

List of Bugs:

PUBDEV-2464 - Using asfactor() in Python client cannot allocate to a variable

PUBDEV-3111 - R API's h2o.interaction() does not use destination_frame argument

PUBDEV-3694 - Errors with PCA on wide data for pca_method = GramSVD which is the default

PUBDEV-3742 - StackedEnsemble should work for regression

PUBDEV-3865 - h2o gbm : for an unseen categorical level, discrepancy in predictions when score using h2o vs pojo/mojo

PUBDEV-3883 - Negative indexing for H2OFrame is buggy in R API

PUBDEV-3894 - Relational operators don't work properly with time columns.

PUBDEV-3966 - java.lang.AssertionError when using h2o.makeGLMModel

PUBDEV-3835 - Standard Errors in GLM: calculating and showing specifically when called

PUBDEV-3965 - Importing data in python returns error - TypeError: expected string or bytes-like object

Hotfix: Remove StackedEnsemble from Flow UI. Training is only supported from Python and R interfaces. Viewing is supported in the Flow UI.

List of Tasks

PUBDEV-3336 - h2o.create_frame(): if randomize=True, value param cannot be used

PUBDEV-3740 - REST: implement simple ensemble generation API

PUBDEV-3843 - Modify R REST API to always return binary data

PUBDEV-3844 - Safe GET calls for POJO/MOJO/genmodel

PUBDEV-3864 - Import files by pattern

PUBDEV-3884 - StackedEnsemble: Add to online documentation

PUBDEV-3940 - Add Stacked Ensemble code examples to R docs

Download here: http://h2o-release.s3.amazonaws.com/h2o/rel-tverberg/4/index.html

本文开发(python)相关术语:python基础教程 python多线程 web开发工程师 软件开发工程师 软件开发流程

主题: UBRESTWordPython
分页:12
转载请注明
本文标题:Stacked Ensembles and Word2Vec now available in H2O!
本站链接:http://www.codesec.net/view/531706.html
分享请点击:


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 开发(python) | 评论(0) | 阅读(33)