未加星标

lightgbm算法优化-不平衡二分类问题机器学习

字体大小 | |
[商业智能 所属分类 商业智能 | 发布者 店小二04 | 时间 | 作者 红领巾 ] 0人收藏点击收藏

lightgbm算法优化-不平衡二分类问题机器学习
算法
lightgbm算法优化-不平衡二分类问题机器学习
模型
lightgbm算法优化-不平衡二分类问题机器学习
数学
lightgbm算法优化-不平衡二分类问题机器学习
案例
lightgbm算法优化-不平衡二分类问题机器学习
微软

本案例使用的数据为kaggle中“Santander Customer Satisfaction”比赛的数据。此案例为不平衡二分类问题,目标为较大化auc值(ROC曲线下方面积)。目前此比赛已经结束。


竞赛题目链接为:

https://www.kaggle.com/c/santander-customer-satisfaction


2.建模思路

本文档采用微软开源的lightgbm算法进行分类,运行速度极快。具体步骤为:

读取数据;


并行运算:由于lightgbm包可以通过设置相应参数进行并行运算,因此不再调用doParallel与foreach包进行并行运算;


特征选择:使用mlr包提取了99%的chi.square;


调参:逐步调试lgb.cv函数的参数,并多次调试,直到满意为止;


预测结果:用调试好的参数值构建lightgbm模型,输出预测结果;本案例所用程序输出结果的ROC值为0.833386,已超过Private Leaderboard排名第一的结果(0.829072)。


3.lightgbm算法

由于lightgbm算法没有给出具体的数学公式,因此此处不再介绍,如有需要,请查看github项目网址。


lightgbm算法具体介绍网址:

https://github.com/Microsoft/LightGBM


读取数据

options(Java.parameters = "-Xmx8g") ## 特征选择时使用,但是需要在加载包之前设置

library(readr)

lgb_tr1 <- read_csv("C:/Users/Administrator/Documents/kaggle/scs_lgb/train.csv")

lgb_te1 <- read_csv("C:/Users/Administrator/Documents/kaggle/scs_lgb/test.csv")


数据探索

1.设置并行运算

library(dplyr)

library(mlr)

library(parallelMap)

parallelStartSocket(2)


2.数据各列初步探索

summarizeColumns(lgb_tr1) %>% View()


3.处理缺失值

impute missing values by mean and mode

imp_tr1 <- impute(

as.data.frame(lgb_tr1),

classes = list(

integer = imputeMean(),

numeric = imputeMean()

)

)

imp_te1 <- impute(

as.data.frame(lgb_te1),

classes = list(

integer = imputeMean(),

numeric = imputeMean()

)

)


处理缺失值后

summarizeColumns(imp_tr1$data) %>% View()


4.观察训练数据类别的比例–数据类别不平衡

table(lgb_tr1$TARGET)


5.剔除数据集中的常数列

lgb_tr2 <- removeConstantFeatures(imp_tr1$data)

lgb_te2 <- removeConstantFeatures(imp_te1$data)


6.保留训练数据集与测试数据及相同的列

tr2_name <- data.frame(tr2_name = colnames(lgb_tr2))

te2_name <- data.frame(te2_name = colnames(lgb_te2))

tr2_name_inner <- tr2_name %>%

inner_join(te2_name, by = c('tr2_name' = 'te2_name'))

TARGET = data.frame(TARGET = lgb_tr2$TARGET)

lgb_tr2 <- lgb_tr2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])] lgb_te2 <- lgb_te2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]

lgb_tr2 <- cbind(lgb_tr2, TARGET)


注:

1)由于本次使用lightgbm算法,故而不对数据进行标准化处理;

2)lightgbm算法运行效率极高,1GB内不进行特征筛选也可以运行的极快,但是此处进行特征筛选,以进一步加快运行速率;

3)本案例直接进行特征筛选,未生成衍生变量,原因为:不知特征实际意义,不好随机生成。


特征筛选–卡方检验

library(lightgbm)


1.试算较大权重值程序,后面将继续优化

grid_search <- expand.grid(

weight = seq(1, 30, 2)

## table(lgb_tr1$TARGET)[1] / table(lgb_tr1$TARGET)[2] = 24.27261 ## 故而设定weight在[1, 30]之间

)


lgb_rate_1 <- numeric(length = nrow(grid_search))

set.seed(0)

for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr2$TARGET * i + 1) / sum(lgb_tr2$TARGET * i + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr2[, 1:300]),

label = lgb_tr2$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc'

)

# 交叉验证

lgb_tr2_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

learning_rate = .1,

num_threads = 2,

early_stopping_rounds = 10

)

lgb_rate_1[i] <- unlist(lgb_tr2_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr2_mod$record_evals$valid$auc$eval))]

}

library(ggplot2)

grid_search$perf <- lgb_rate_1

ggplot(grid_search,aes(x = weight, y = perf)) +

geom_point()

从此图可知auc值受权重影响不大,在weight=5时达到较大。


3.特征选择

1)特征选择

lgb_tr2$TARGET <- factor(lgb_tr2$TARGET)

lgb.task <- makeClassifTask(data = lgb_tr2, target = 'TARGET')

lgb.task.smote <- oversample(lgb.task, rate = 5)

fv_time <- system.time(

fv <- generateFilterValuesData(

lgb.task.smote,

method = c('chi.squared')

## 此处可以使用信息增益/卡方检验的方法,但是不建议使用随机森林方法,效率极低

## 如果有兴趣,也可以尝试IV值方法筛选

## 特征工程决定目标值(此处为auc)的上限,可以把特征筛选方法作为超参数处理

)

)


2)制图查看

# plotFilterValues(fv)

plotFilterValuesGGVIS(fv)


3)提取99%的chi.squared(lightgbm算法效率极高,因此可以取更多的变量)


注:提取的X%的chi.squared中的X可以作为超参数处理。

fv_data2 <- fv$data %>%

arrange(desc(chi.squared)) %>%

mutate(chi_gain_cul = cumsum(chi.squared) / sum(chi.squared))


fv_data2_filter <- fv_data2 %>% filter(chi_gain_cul <= 0.99)

dim(fv_data2_filter) ## 减少了一半的自变量

fv_feature <- fv_data2_filter$name

lgb_tr3 <- lgb_tr2[, c(fv_feature, 'TARGET')] lgb_te3 <- lgb_te2[, fv_feature]

4)写出数据

write_csv(lgb_tr3, 'C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi.csv')

write_csv(lgb_te3, 'C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi.csv')


算法

lgb_tr <- rxImport('C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi.csv')

lgb_te <- rxImport('C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi.csv')

## 建议lgb_te数据在预测时再读取,以节约内存

library(lightgbm)


1.调试weight参数

grid_search <- expand.grid(

weight = 1:30

)


perf_weight_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc'

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

learning_rate = .1,

num_threads = 2,

early_stopping_rounds = 10

)

perf_weight_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


library(ggplot2)

grid_search$perf <- perf_weight_1

ggplot(grid_search,aes(x = weight, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在weight=4时达到较大,呈递减趋势。


2.调试learning_rate参数

grid_search <- expand.grid(

learning_rate = 2 ^ (-(8:1))

)


perf_learning_rate_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_learning_rate_1

ggplot(grid_search,aes(x = learning_rate, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在learning_rate=2^(-5) 时达到较大,但是 2^(-(6:3)) 区别极小,故取learning_rate = .125,提高运行速度。


3.调试num_leaves参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = seq(50, 800, 50)

)


perf_num_leaves_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_num_leaves_1

ggplot(grid_search,aes(x = num_leaves, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在num_leaves=650时达到较大。


4.调试min_data_in_leaf参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = 650,

min_data_in_leaf = 2 ^ (1:7)

)


perf_min_data_in_leaf_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], min_data_in_leaf = grid_search[i, 'min_data_in_leaf']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_min_data_in_leaf_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_min_data_in_leaf_1

ggplot(grid_search,aes(x = min_data_in_leaf, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值对min_data_in_leaf不敏感,因此不做调整。


5.调试max_bin参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = 650,

max_bin = 2 ^ (5:10)

)


perf_max_bin_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_max_bin_1

ggplot(grid_search,aes(x = max_bin, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在max_bin=2^10 时达到较大,需要再次微调max_bin值。


6.微调max_bin参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = 650,

max_bin = 100 * (6:15)

)


perf_max_bin_2 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_max_bin_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_max_bin_2

ggplot(grid_search,aes(x = max_bin, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在max_bin=1000时达到较大。


7.调试min_data_in_bin参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 2 ^ (1:9)

)


perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_min_data_in_bin_1

ggplot(grid_search,aes(x = min_data_in_bin, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在min_data_in_bin=8时达到较大,但是变化极其细微,因此不做调整。


8.调试feature_fraction参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = seq(.5, 1, .02)

)


perf_feature_fraction_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_feature_fraction_1

ggplot(grid_search,aes(x = feature_fraction, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在feature_fraction=.62时达到较大,feature_fraction在[.60,.62]之间时,auc值保持稳定,表现较好;从.64开始呈下降趋势。

9.调试min_sum_hessian参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = .62,

min_sum_hessian = seq(0, .02, .001)

)


perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_min_sum_hessian_1

ggplot(grid_search,aes(x = min_sum_hessian, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在min_sum_hessian=0.005时达到较大,建议min_sum_hessian取值在[0.002, 0.005]区间,0.005后呈递减趋势。

10.调试lamda参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = .62,

min_sum_hessian = .005,

lambda_l1 = seq(0, .01, .002),

lambda_l2 = seq(0, .01, .002)

)


perf_lamda_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_lamda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_lamda_1

ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) +

geom_point() +

facet_wrap(~ lambda_l2, nrow = 5)

从此图可知建议lambda_l1 = 0, lambda_l2 = 0


11.调试drop_rate参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = .62,

min_sum_hessian = .005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = seq(0, 1, .1)

)


perf_drop_rate_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_drop_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_drop_rate_1

ggplot(data = grid_search, aes(x = drop_rate, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在drop_rate=0.2时达到较大,在0, .2, .5较好;在[0, 1]变化不大。

12.调试max_drop参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = .62,

min_sum_hessian = .005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = .2,

max_drop = seq(1, 10, 2)

)


perf_max_drop_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_max_drop_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_max_drop_1

ggplot(data = grid_search, aes(x = max_drop, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在max_drop=5时达到较大,在[1, 10]区间变化较小。

二次调参

1.调试weight参数

grid_search <- expand.grid(

learning_rate = .125,

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = .62,

min_sum_hessian = .005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = .2,

max_drop = 5

)


perf_weight_2 <- numeric(length = nrow(grid_search))


for(i in 1:20){

lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[1, 'learning_rate'], num_leaves = grid_search[1, 'num_leaves'], max_bin = grid_search[1, 'max_bin'], min_data_in_bin = grid_search[1, 'min_data_in_bin'], feature_fraction = grid_search[1, 'feature_fraction'], min_sum_hessian = grid_search[1, 'min_sum_hessian'], lambda_l1 = grid_search[1, 'lambda_l1'], lambda_l2 = grid_search[1, 'lambda_l2'], drop_rate = grid_search[1, 'drop_rate'], max_drop = grid_search[1, 'max_drop']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

learning_rate = .1,

num_threads = 2,

early_stopping_rounds = 10

)

perf_weight_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


library(ggplot2)

ggplot(data.frame(num = 1:length(perf_weight_2), perf = perf_weight_2), aes(x = num, y = perf)) +

geom_point() +

geom_smooth()

从此图可知auc值在weight>=3时auc趋于稳定, weight=7 the max


2.调试learning_rate参数

grid_search <- expand.grid(

learning_rate = seq(.05, .5, .03),

num_leaves = 650,

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = .62,

min_sum_hessian = .005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = .2,

max_drop = 5

)


perf_learning_rate_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_learning_rate_1

ggplot(data = grid_search, aes(x = learning_rate, y = perf)) +

geom_point() +

geom_smooth()

结论:learning_rate=.11时,auc较大。


3.调试num_leaves参数

grid_search <- expand.grid(

learning_rate = .11,

num_leaves = seq(100, 800, 50),

max_bin=1000,

min_data_in_bin = 8,

feature_fraction = .62,

min_sum_hessian = .005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = .2,

max_drop = 5

)


perf_num_leaves_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_num_leaves_1

ggplot(data = grid_search, aes(x = num_leaves, y = perf)) +

geom_point() +

geom_smooth()

结论:num_leaves=200时,auc较大。


4.调试max_bin参数

grid_search <- expand.grid(

learning_rate = .11,

num_leaves = 200,

max_bin = seq(100, 1500, 100),

min_data_in_bin = 8,

feature_fraction = .62,

min_sum_hessian = .005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = .2,

max_drop = 5

)


perf_max_bin_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_max_bin_1

ggplot(data = grid_search, aes(x = max_bin, y = perf)) +

geom_point() +

geom_smooth()

结论:max_bin=600时,auc较大;400,800也是可接受值。


5.调试min_data_in_bin参数

grid_search <- expand.grid(

learning_rate = .11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = seq(5, 50, 5),

feature_fraction = .62,

min_sum_hessian = .005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = .2,

max_drop = 5

)


perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_min_data_in_bin_1

ggplot(data = grid_search, aes(x = min_data_in_bin, y = perf)) +

geom_point() +

geom_smooth()

结论:min_data_in_bin=45时,auc较大;其中25是可接受值。


6.调试feature_fraction参数

grid_search <- expand.grid(

learning_rate = .11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = 45,

feature_fraction = seq(.5, .9, .02),

min_sum_hessian = .005,

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = .2,

max_drop = 5

)


perf_feature_fraction_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_feature_fraction_1

ggplot(data = grid_search, aes(x = feature_fraction, y = perf)) +

geom_point() +

geom_smooth()

结论:feature_fraction=.54时,auc较大, .56, .58时也较好。


7.调试min_sum_hessian参数

grid_search <- expand.grid(

learning_rate = .11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = 45,

feature_fraction = .54,

min_sum_hessian = seq(.001, .008, .0005),

lambda_l1 = 0,

lambda_l2 = 0,

drop_rate = .2,

max_drop = 5

)


perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_sum_hessian = grid_search[i, 'min_sum_hessian'], lambda_l1 = grid_search[i, 'lambda_l1'], lambda_l2 = grid_search[i, 'lambda_l2'], drop_rate = grid_search[i, 'drop_rate'], max_drop = grid_search[i, 'max_drop']

)

# 交叉验证

lgb_tr_mod <- lgb.cv(

params,

data = lgb_train,

nrounds = 300,

stratified = TRUE,

nfold = 10,

num_threads = 2,

early_stopping_rounds = 10

)

perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]

}


grid_search$perf <- perf_min_sum_hessian_1

ggplot(data = grid_search, aes(x = min_sum_hessian, y = perf)) +

geom_point() +

geom_smooth()

结论:min_sum_hessian=0.0065时auc取得较大值,取min_sum_hessian=0.003,0.0055时可接受。


8.调试lambda参数

grid_search <- expand.grid(

learning_rate = .11,

num_leaves = 200,

max_bin = 600,

min_data_in_bin = 45,

feature_fraction = .54,

min_sum_hessian = 0.0065,

lambda_l1 = seq(0, .001, .0002),

lambda_l2 = seq(0, .001, .0002),

drop_rate = .2,

max_drop = 5

)


perf_lambda_1 <- numeric(length = nrow(grid_search))


for(i in 1:nrow(grid_search)){

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

lgb_train <- lgb.Dataset(

data = data.matrix(lgb_tr[, 1:148]),

label = lgb_tr$TARGET,

free_raw_data = FALSE,

weight = lgb_weight

)

# 参数

params <- list(

objective = 'binary',

metric = 'auc',

learning_rate = grid_search[i, 'learning_rate'], num_leaves = grid_search[i, 'num_leaves'], max_bin = grid_search[i, 'max_bin'], min_data_in_bin = grid_search[i, 'min_data_in_bin'], feature_fraction = grid_search[i, 'feature_fraction'], min_
tags: #160,lgb,grid,search,lt,data,tr,bin,min,rate,perf,max,num,auc
分页:12
转载请注明
本文标题:lightgbm算法优化-不平衡二分类问题机器学习
本站链接:https://www.codesec.net/view/573732.html


1.凡CodeSecTeam转载的文章,均出自其它媒体或其他官网介绍,目的在于传递更多的信息,并不代表本站赞同其观点和其真实性负责;
2.转载的文章仅代表原创作者观点,与本站无关。其原创性以及文中陈述文字和内容未经本站证实,本站对该文以及其中全部或者部分内容、文字的真实性、完整性、及时性,不作出任何保证或承若;
3.如本站转载稿涉及版权等问题,请作者及时联系本站,我们会及时处理。
登录后可拥有收藏文章、关注作者等权限...
技术大类 技术大类 | 商业智能 | 评论(0) | 阅读(439)