不写R包的分析师不是好全栈

GBM基础理论推导

    R

闲来无事(别逗你的四条deadline怎么办….),推一推GBM(Gradient Boosting Method)在平方损失和对数损失下的实现方式,首先先挂链接:wiki-gbm是wiki百科上的介绍,这里讲了一些较直观的理解和理论,肖老师的博客提供了基于rpart的实现(一如既往的需要翻墙)

GBM是一个提升的方法来对模型来进行迭代式的提升,没错,是模型!理论上来讲,一个具有显式表达式的模型都是可以使用GBM来进行提升的,而最为我们熟知的,就是R中的GBM,实际上常用的是GBDT(Gradient Boosting Decision Tree)也就是通过GBM的方法来对决策树进行提升.


GBM的算法


GBM的算法如下:



初始化模型,计算一个常数(F_0(x)):




(F0(x) = \underset{\gamma}{\arg\min} \sum{i=1}^n L(yi, \gamma)).





For (m = 1) to (M):






计算“伪残差”(pseudo-residuals)






(r{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(xi)}\right]{F(x)=F_{m-1}(x)} \quad \mbox{for } i=1,\ldots,n).






根据“伪残差”来训练一个新的模型(h_m(x)), i.e. train it using the training set ({(xi, r{im})}_{i=1}^n).






通过解下面的优化问题来确定参数(\gamma_m)






(\gammam = \underset{\gamma}{\operatorname{arg\,min}} \sum{i=1}^n L\left(yi, F{m-1}(x_i) + \gamma h_m(x_i)\right)).






更新模型:






(Fm(x) = F{m-1}(x) + \gamma_m h_m(x)).





输出结果 (F_M(x)).



对整个算法做个简要的说明:



  • 最终估计的是 (F_M(x)),可以看做是一系列弱分类器的集合

    • (FM(x) = \sum{m=0}^n \gamma_m h_m(x))

    • (F_0(x)) 可以看做模型的常数项


  • 整个模型的思路类似Newton-Rapson方法,通过梯度下降的方式来逐步减小损失函数,并找到最优解

  • 模型的实现的是通过对“伪残差”再次进行训练来不断减小损失函数(L(y,F))

  • 模型中优化问题的求解常常被省略掉,取而代之的是用一个给定的shrinkage参数来进行控制


接下来,我会用平方损失(square-loss)和对数损失函数(log-loss)来对这个模型的流程做简单的推导




模型推导



平方损失


平方损失又可以称作高斯损失,损失函数的定义如下: [L(y_i,f(x_i)) = (y_i-f(x_i))^2/2] 损失函数求和有时需要增加权重: [\frac{1}{2\sum \omega_i} \sum\omega_i(y_i-f(x_i))^2]


其中(\omega_i) 是高斯损失的权重,平时不设置这个权重的话就全部取1, (f(x))如果非中心化,就使用 (f(x) + o_i) 来代替



计算初始值


根据上面的算法,计算初始值: [F0(x) = \underset{\gamma}{\arg\min} \sum{i=1}^n L(y_i, \gamma) \
\underset{\gamma}{\arg\min} \frac{1}{2\sum \omega_i}\sum \omega_i(y_i-f_0(x_i))^2]
.


对上式中的损失函数求导并令其为0:


[\frac{\partial \frac{1}{2\sum \omega_i}\sum \omega_i(y_i-f_0(x_i))^2}{\partial f(x_i)} \
=\frac{1}{2\sum \omega_i}\sum \omega_i(y_i-f_0(x_i) - o_i) \
= 0 ]


[\sum f_0(x_i)\omega_i(y_i-f_0(x_i) - o_i) = 0] [f_0(x) = \frac{\sum \omega_i(y_i-o_i)}{\sum \omega_i}]


特别的,如果模型是中心化的((o_i =0)),并且不设置权重((\omega_i= 1))


[f_0(x) = \frac{\sum \omega_i(yi)}{n} = \overline{y}]




计算伪残差


这部分我直接使用默认的无权重情况进行迭代计算:


[r{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(xi)}\right]{F(x)=F_{m-1}(x)}] [ = -\left[\frac{\partial (y_i-f(x_i))^2/2)}{\partial f(xi)}\right]{f(x)=f_{m-1}(x)}] [ = yi - f{m-1}(xi)]




之后的部分



  • 拟合新模型(比如rpart分类树)所有数据对伪残差(r{im})

  • 更新模型

    • 去除线性规划部分,如果要用是一个N-R方法的近似值

    • (Fm(x) = F{m-1}(x) + \gamma_m h_m(x))

    • 使用shrinkage参数代替 (\gamma)





代码实现


跟随肖老师的步伐,手写这段代码,模型训练使用rpart包来完成



  • 模型训练函数


library(rpart)
gbm_gaussian = function(data, shrinkage = 0.01, iter = 100,…){
learner = list()
## Initial value
f = mean(data$y == 1)
## pseudo-residuals
data$dy = data$y - f
for( i in 1:iter){
learner[[i]] = rpart(dy ~x1+x2,data = data,
control=rpart.control(maxdepth=1,cp=0))
## Improve model:new predict value for each observations
f = f + shrinkagepredict(learner[[i]])
## update pseudo-residuals
data$dy = data$y -f
}
result = list(‘learner’ = learner,
‘shrinkage’ = shrinkage,
‘iter’ = iter)
return(result)
}


  • 建立一个预测模型来对上述模型进行预测:


predict_gbm_gaussian = function(model,newdata){
predict_y = list()
f = mean(data$y ==1)
n = nrow(data)
iter = model$iter
shrinkage = model$shrinkage
learner = model$learner
predict_y[[1]] = f + shrinkage predict(learner[[1]],newdata = data)

for(i in 2:iter){
predict_y[[i]] = predict_y[[i-1]] + shrinkage predict(learner[[i]],newdata = data)
}
mse = sapply(predict_y,function(pre_y) sum((pre_y- data$y)^2/2))
result = list(‘predict_y’ = predict_y,
‘mse’ = mse)
return(result)
}


  • 试试数据:

    • 训练模型



data = iris[1:100,1:4]
data$y = rep(0:1,each = 50)
names(data)[1:2] = c("x1","x2")
head(data)

##    x1  x2 Petal.Length Petal.Width y
## 1 5.1 3.5 1.4 0.2 0
## 2 4.9 3.0 1.4 0.2 0
## 3 4.7 3.2 1.3 0.2 0
## 4 4.6 3.1 1.5 0.2 0
## 5 5.0 3.6 1.4 0.2 0
## 6 5.4 3.9 1.7 0.4 0

model = gbm_gaussian(data,0.001,1000)
objects(model)

## [1] "iter"      "learner"   "shrinkage"

final = predict_gbm_gaussian(model,data)$predict_y[[1000]]
y_pred = ifelse(final>0.5,1,-1)
table(data$y, y_pred)

##    y_pred
## -1 1
## 0 45 5
## 1 6 44

试试GBM:


library(gbm)

## Loading required package: survival
## Loading required package: splines
## Loading required package: lattice
## Loading required package: parallel
## Loaded gbm 2.1

model = gbm(y~x1+x2,distribution = "gaussian",
n.minobsinnode = 1,
bag.fraction =1,
data = data,n.trees = 1000,
shrinkage=0.001
)
final = predict(model,data,1000)
y_pred = ifelse(final>0.5,1,-1)
table(data$y, y_pred)

##    y_pred
## -1 1
## 0 45 5
## 1 6 44




logloss下的推导:


logloss,对数损失函数的形式是: [log(p(y|x,w))] 但是我们经常看到大家说对数损失函数是这样的形式((y \in {-1,1})):


[L(y,\eta) = -log(p(y|x,w)) = log(1+e^{-y \eta})]


具体的原因是这样的,对于一个分类器,实际上可以理解为计算一个response的odds(比率):


[f(x_i) = log\frac{p(y=1|x_i,w)}{p(y=-1|x_i,w)} = \eta_i]


所以(P = p(y=1|x,w)) 可以看作: [
\eta = log \frac{P}{1-P} \
p(y=1|x,w) = P = \frac{e^{\eta}}{1+e^{\eta}}]


(p(y=-1|x,w) = 1-P) 可以看作: [
p(y=1|x,w) = 1- P = 1- \frac{e^{\eta}}{1+e^{\eta}}\
p(y=1|x,w) = \frac{e^{-\eta}}{1+e^{-\eta}}
]


合起来写就是: [p(y|x,w) = \frac{e^{y_i\eta}}{1+e^{y_i\eta}}]


所以对数损失函数写成: [L(y,\eta) = -log(p(y|x,w)) \
= -log\frac{e^{y_i\eta}}{1+e^{y_i\eta}})\
= log\frac{1+e^{y_i\eta}}{e^{y_i\eta}})\
= log(1+e^{-y \eta})]



计算初始值


根据上面的算法,计算初始值: [F0(x) = \underset{\gamma}{\arg\min} \sum{i=1}^n L(yi, \gamma) \
\underset{\gamma}{\arg\min} \sum
{i=1}^n log(1+e^{-y f0(x)})]
.


对上式中的损失函数求导并令其为0:


[\frac{\partial \sum{i=1}^n log(1+e^{-y f_0(x)})}{\partial f(x_i)} \
=\sum\frac{-yexp(-y)f_0(x)}{1+exp(-yf0(x))}\
= \sum
{y=1} \frac{-exp(-f_0(x))}{1+exp(-f0(x))} + \sum{y=0} \frac{exp(f_0(x))}{1+exp(f_0(x))}\
= \overline{y}n\frac{-1}{1+exp(f_0(x))} + (1-\overline{y})n\frac{exp(f_0(x))}{1+exp(f_0(x))} = 0 \
(1-\overline{y})exp(f_0(x)) - \overline{y} = 0\
exp(f_0(x)) = \frac{\overline{y}}{1-\overline{y}}\
f0(x) = log\frac{\overline{y}}{1-\overline{y}}]




计算伪残差


[r{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(xi)}\right]{F(x)=F_{m-1}(x)}] [ = -\left[\frac{\partial (log(1+e^{-yf(x_i)}))}{\partial f(xi)}\right]{f(x)=f_{m-1}(x)}] [ = -\frac{-yexp(-yf(x))}{1+exp(-yf(x))}] [ = \frac{y}{1+exp(yf(x))}]




代码实现


这部分就不写了,贴肖老师的代码~ 来自这里这里这里这里


library(ggplot2)

## Warning: package ‘ggplot2’ was built under R version 3.1.3

# read data
data = subset(iris,Species!=’virginica’,select = c(1,2,5))
data$y = ifelse(data$Species == ‘setosa’,1,-1)
data$Species = NULL
names(data)[1:2] = c(‘x1’,’x2’)
head(data)

##    x1  x2 y
## 1 5.1 3.5 1
## 2 4.9 3.0 1
## 3 4.7 3.2 1
## 4 4.6 3.1 1
## 5 5.0 3.6 1
## 6 5.4 3.9 1

p = ggplot(data,aes(x1,x2,color=factor(y)))+
geom_point()
print(p)


# train boosting tree for classification
GBC = function(data,rate=0.1,iter=100){
library(rpart)
max_depth=1
learner = list()
mu = mean(data$y==1)
# start with an initial model
# mu=p(y=1) -> f=wx = log(mu/(1-mu))
f = log(mu/(1-mu))
data$dy = data$y/(1+exp(data$yf)) # dy is negtive gradient of log loss funtion
for (i in 1:iter) {
# use a weak model to improve
learner[[i]] = rpart(dy~x1+x2,data=data,
control=rpart.control(maxdepth=max_depth,cp=0))
# improve model
f = f + rate
predict(learner[[i]])
# modify dy
data$dy = data$y/(1+exp(data$yf))
}
result = list(‘learner’=learner,
‘rate’=rate,
‘iter’=iter)
return(result)
}

model = GBC(data,rate=0.1,iter=1000)
objects(model)

## [1] "iter"    "learner" "rate"

# predict function
predict_GBC = function(data,model){
predict_y = list()
mu = mean(data$y==1)
f = log(mu/(1-mu))
n = nrow(data)
iter = model$iter
rate = model$rate

predict_y[[1]] = rep(f,n)
learner = model$learner
for (i in 2:iter) {
predict_y[[i]] = predict_y[[i-1]] + rate predict(learner[[i-1]],newdata=data)
}
mse = sapply(predict_y,function(pre_y) sum(log(1+exp(-data$y*pre_y)))) # logistic loss function
result = list(‘predict_y’=predict_y,
‘mse’= mse)
return(result)
}

# predict data
predict_train = predict_GBC(data,model=model)
objects(predict_train)

## [1] "mse"       "predict_y"

plot(predict_train$mse,type=’l’)


final = predict_train$predict_y[[1000]]
y_p = 1/(1+exp(-final))
y_pred = ifelse(y_p>0.5,1,-1)
table(data$y, y_pred)

##     y_pred
## -1 1
## -1 50 0
## 1 0 50



page PV:  ・  site PV:  ・  site UV: