GBM基础理论推导
闲来无事(别逗你的四条deadline怎么办….),推一推GBM(Gradient Boosting Method)在平方损失和对数损失下的实现方式,首先先挂链接:wiki-gbm是wiki百科上的介绍,这里讲了一些较直观的理解和理论,肖老师的博客提供了基于rpart的实现(一如既往的需要翻墙)
GBM是一个提升的方法来对模型来进行迭代式的提升,没错,是模型!理论上来讲,一个具有显式表达式的模型都是可以使用GBM来进行提升的,而最为我们熟知的,就是R中的GBM,实际上常用的是GBDT(Gradient Boosting Decision Tree)也就是通过GBM的方法来对决策树进行提升.
GBM的算法
GBM的算法如下:
初始化模型,计算一个常数$F_0(x)$
:
$$F_0(x) = \underset{\gamma}{\arg\min} \sum_{i=1}^n L(y_i, \gamma)$$
.
For $m = 1$
to $M$
:
计算“伪残差”(pseudo-residuals)
$$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)} \quad \mbox{for } i=1,\ldots,n$$
根据“伪残差”来训练一个新的模型
$h_m(x)$
, i.e. train it using the training set ${(x_i, r_{im})}_{i=1}^n$
.通过解下面的优化问题来确定参数
$\gamma_m$
$$\gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^n L\left(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)\right)$$
.更新模型:
$$F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)$$
.输出结果
$$F_M(x)$$
.
对整个算法做个简要的说明:
$$F_M(x)$$
,可以看做是一系列弱分类器的集合$$F_M(x) = \sum_{m=0}^n \gamma_m h_m(x)$$
$$F_0(x)$$
可以看做模型的常数项$$L(y,F)$$
shrinkage
参数来进行控制接下来,我会用平方损失(square-loss)和对数损失函数(log-loss)来对这个模型的流程做简单的推导
模型推导
平方损失
平方损失又可以称作高斯损失,损失函数的定义如下: $$L(y_i,f(x_i)) = (y_i-f(x_i))^2/2$$
损失函数求和有时需要增加权重: $$\frac{1}{2\sum \omega_i} \sum\omega_i(y_i-f(x_i))^2$$
其中$$\omega_i$$
是高斯损失的权重,平时不设置这个权重的话就全部取1, $$f(x)$$
如果非中心化,就使用 $$f(x) + o_i$$
来代替
计算初始值
根据上面的算法,计算初始值: $$F_0(x) = \underset{\gamma}{\arg\min} \sum_{i=1}^n L(y_i, \gamma) $$
$$\underset{\gamma}{\arg\min} \frac{1}{2\sum \omega_i}\sum \omega_i(y_i-f_0(x_i))^2$$
.
对上式中的损失函数求导并令其为0:
$$\frac{\partial \frac{1}{2\sum \omega_i}\sum \omega_i(y_i-f_0(x_i))^2}{\partial f(x_i)} $$
$$=\frac{1}{2\sum \omega_i}\sum \omega_i(y_i-f_0(x_i) - o_i) $$
$$= 0 $$
$$\sum f_0(x_i)\omega_i(y_i-f_0(x_i) - o_i) = 0$$
$$f_0(x) = \frac{\sum \omega_i(y_i-o_i)}{\sum \omega_i}$$
特别的,如果模型是中心化的($$o_i =0$$
),并且不设置权重($$\omega_i= 1$$
)
$$f_0(x) = \frac{\sum \omega_i(y_i)}{n} = \overline{y}$$
计算伪残差
这部分我直接使用默认的无权重情况进行迭代计算:
$$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)}$$
$$ = -\left[\frac{\partial (y_i-f(x_i))^2/2)}{\partial f(x_i)}\right]_{f(x)=f_{m-1}(x)}$$
$$ = y_i - f_{m-1}(x_i)$$
之后的部分
- 拟合新模型(比如rpart分类树)所有数据对伪残差
$$r_{im}$$
- 更新模型
- 去除线性规划部分,如果要用是一个N-R方法的近似值
$$F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)$$
- 使用
shrinkage
参数代替$$\gamma$$
代码实现
跟随肖老师的步伐,手写这段代码,模型训练使用rpart
包来完成
- 模型训练函数
library(rpart)
gbm_gaussian = function(data, shrinkage = 0.01, iter = 100,…){
learner = list()
## Initial value
f = mean(data$y == 1)
## pseudo-residuals
data$dy = data$y - f
for( i in 1:iter){
learner[[i]] = rpart(dy ~x1+x2,data = data,
control=rpart.control(maxdepth=1,cp=0))
## Improve model:new predict value for each observations
f = f + shrinkagepredict(learner[[i]])
## update pseudo-residuals
data$dy = data$y -f
}
result = list(‘learner’ = learner,
‘shrinkage’ = shrinkage,
‘iter’ = iter)
return(result)
}
- 建立一个预测模型来对上述模型进行预测:
predict_gbm_gaussian = function(model,newdata){
predict_y = list()
f = mean(data$y ==1)
n = nrow(data)
iter = model$iter
shrinkage = model$shrinkage
learner = model$learner
predict_y[[1]] = f + shrinkage predict(learner[[1]],newdata = data)
for(i in 2:iter){
predict_y[[i]] = predict_y[[i-1]] + shrinkage predict(learner[[i]],newdata = data)
}
mse = sapply(predict_y,function(pre_y) sum((pre_y- data$y)^2/2))
result = list(‘predict_y’ = predict_y,
‘mse’ = mse)
return(result)
}
- 试试数据:
- 训练模型
data = iris[1:100,1:4]
data$y = rep(0:1,each = 50)
names(data)[1:2] = c("x1","x2")
head(data)
## x1 x2 Petal.Length Petal.Width y
## 1 5.1 3.5 1.4 0.2 0
## 2 4.9 3.0 1.4 0.2 0
## 3 4.7 3.2 1.3 0.2 0
## 4 4.6 3.1 1.5 0.2 0
## 5 5.0 3.6 1.4 0.2 0
## 6 5.4 3.9 1.7 0.4 0
model = gbm_gaussian(data,0.001,1000)
objects(model)
## [1] "iter" "learner" "shrinkage"
final = predict_gbm_gaussian(model,data)$predict_y[[1000]]
y_pred = ifelse(final>0.5,1,-1)
table(data$y, y_pred)
## y_pred
## -1 1
## 0 45 5
## 1 6 44
试试GBM:
library(gbm)
## Loading required package: survival
## Loading required package: splines
## Loading required package: lattice
## Loading required package: parallel
## Loaded gbm 2.1
model = gbm(y~x1+x2,distribution = "gaussian",
n.minobsinnode = 1,
bag.fraction =1,
data = data,n.trees = 1000,
shrinkage=0.001
)
final = predict(model,data,1000)
y_pred = ifelse(final>0.5,1,-1)
table(data$y, y_pred)
## y_pred
## -1 1
## 0 45 5
## 1 6 44
logloss下的推导:
logloss,对数损失函数的形式是: $log(p(y|x,w))$
但是我们经常看到大家说对数损失函数是这样的形式($y \in {-1,1}$
):
$$L(y,\eta) = -log(p(y|x,w)) = log(1+e^{-y \eta})$$
具体的原因是这样的,对于一个分类器,实际上可以理解为计算一个response的odds(比率):
$$f(x_i) = log\frac{p(y=1|x_i,w)}{p(y=-1|x_i,w)} = \eta_i$$
所以$P = p(y=1|x,w)$
可以看作: $$\eta = log \frac{P}{1-P} $$
$$p(y=1|x,w) = P = \frac{e^{\eta}}{1+e^{\eta}}$$
而$p(y=-1|x,w) = 1-P$
可以看作:$$p(y=1|x,w) = 1- P = 1- \frac{e^{\eta}}{1+e^{\eta}}$$
$$ p(y=1|x,w) = \frac{e^{-\eta}}{1+e^{-\eta}}$$
合起来写就是: $$p(y|x,w) = \frac{e^{y_i\eta}}{1+e^{y_i\eta}}$$
所以对数损失函数写成: $$L(y,\eta) = -log(p(y|x,w)) $$
$$ = -log\frac{e^{y_i\eta}}{1+e^{y_i\eta}})$$
$$ = log\frac{1+e^{y_i\eta}}{e^{y_i\eta}})$$
$$= log(1+e^{-y \eta})$$
计算初始值
根据上面的算法,计算初始值: $$F_0(x) = \underset{\gamma}{\arg\min} \sum_{i=1}^n L(y_i, \gamma) \underset{\gamma}{\arg\min} \sum_{i=1}^n log(1+e^{-y f_0(x)})$$
.
对上式中的损失函数求导并令其为0:
$$\frac{\partial \sum_{i=1}^n log(1+e^{-y f_0(x)})}{\partial f(x_i)} $$
$$=\sum\frac{-yexp(-y)f_0(x)}{1+exp(-yf_0(x))}$$
$$= \sum_{y=1} \frac{-exp(-f_0(x))}{1+exp(-f_0(x))} + \sum_{y=0} \frac{exp(f_0(x))}{1+exp(f_0(x))}$$
$$= \overline{y}n\frac{-1}{1+exp(f_0(x))} + (1-\overline{y})n\frac{exp(f_0(x))}{1+exp(f_0(x))} = 0 $$
$$(1-\overline{y})exp(f_0(x)) - \overline{y} = 0$$
$$exp(f_0(x)) = \frac{\overline{y}}{1-\overline{y}}$$
$$f_0(x) = log\frac{\overline{y}}{1-\overline{y}}$$
计算伪残差
$$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)}$$
$$ = -\left[\frac{\partial (log(1+e^{-yf(x_i)}))}{\partial f(x_i)}\right]_{f(x)=f_{m-1}(x)}$$
$$ = -\frac{-yexp(-yf(x))}{1+exp(-yf(x))}$$
$$ = \frac{y}{1+exp(yf(x))}$$
代码实现
这部分就不写了,贴肖老师的代码~ 来自这里这里这里这里
library(ggplot2)
## Warning: package ‘ggplot2’ was built under R version 3.1.3
# read data
data = subset(iris,Species!=’virginica’,select = c(1,2,5))
data$y = ifelse(data$Species == ‘setosa’,1,-1)
data$Species = NULL
names(data)[1:2] = c(‘x1’,’x2’)
head(data)
## x1 x2 y
## 1 5.1 3.5 1
## 2 4.9 3.0 1
## 3 4.7 3.2 1
## 4 4.6 3.1 1
## 5 5.0 3.6 1
## 6 5.4 3.9 1
p = ggplot(data,aes(x1,x2,color=factor(y)))+
geom_point()
print(p)
# train boosting tree for classification
GBC = function(data,rate=0.1,iter=100){
library(rpart)
max_depth=1
learner = list()
mu = mean(data$y==1)
# start with an initial model
# mu=p(y=1) -> f=wx = log(mu/(1-mu))
f = log(mu/(1-mu))
data$dy = data$y/(1+exp(data$yf)) # dy is negtive gradient of log loss funtion
for (i in 1:iter) {
# use a weak model to improve
learner[[i]] = rpart(dy~x1+x2,data=data,
control=rpart.control(maxdepth=max_depth,cp=0))
# improve model
f = f + rate predict(learner[[i]])
# modify dy
data$dy = data$y/(1+exp(data$yf))
}
result = list(‘learner’=learner,
‘rate’=rate,
‘iter’=iter)
return(result)
}
model = GBC(data,rate=0.1,iter=1000)
objects(model)
## [1] "iter" "learner" "rate"
# predict function
predict_GBC = function(data,model){
predict_y = list()
mu = mean(data$y==1)
f = log(mu/(1-mu))
n = nrow(data)
iter = model$iter
rate = model$rate
predict_y[[1]] = rep(f,n)
learner = model$learner
for (i in 2:iter) {
predict_y[[i]] = predict_y[[i-1]] + rate predict(learner[[i-1]],newdata=data)
}
mse = sapply(predict_y,function(pre_y) sum(log(1+exp(-data$y*pre_y)))) # logistic loss function
result = list(‘predict_y’=predict_y,
‘mse’= mse)
return(result)
}
# predict data
predict_train = predict_GBC(data,model=model)
objects(predict_train)
## [1] "mse" "predict_y"
plot(predict_train$mse,type=’l’)
final = predict_train$predict_y[[1000]]
y_p = 1/(1+exp(-final))
y_pred = ifelse(y_p>0.5,1,-1)
table(data$y, y_pred)
## y_pred
## -1 1
## -1 50 0
## 1 0 50