不写R包的分析师不是好全栈

GBM基础理论推导

    R

闲来无事(别逗你的四条deadline怎么办….),推一推GBM(Gradient Boosting Method)在平方损失和对数损失下的实现方式,首先先挂链接:wiki-gbm是wiki百科上的介绍,这里讲了一些较直观的理解和理论,肖老师的博客提供了基于rpart的实现(一如既往的需要翻墙)


GBM是一个提升的方法来对模型来进行迭代式的提升,没错,是模型!理论上来讲,一个具有显式表达式的模型都是可以使用GBM来进行提升的,而最为我们熟知的,就是R中的GBM,实际上常用的是GBDT(Gradient Boosting Decision Tree)也就是通过GBM的方法来对决策树进行提升.



GBM的算法


GBM的算法如下:

初始化模型,计算一个常数$F_0(x)$:

$$F_0(x) = \underset{\gamma}{\arg\min} \sum_{i=1}^n L(y_i, \gamma)$$.
For $m = 1$ to $M$:
计算“伪残差”(pseudo-residuals)

$$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)} \quad \mbox{for } i=1,\ldots,n$$

.
根据“伪残差”来训练一个新的模型$h_m(x)$, i.e. train it using the training set ${(x_i, r_{im})}_{i=1}^n$.
通过解下面的优化问题来确定参数$\gamma_m$
$$\gamma_m = \underset{\gamma}{\operatorname{arg\,min}} \sum_{i=1}^n L\left(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)\right)$$.
更新模型:
$$F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)$$.
输出结果 $$F_M(x)$$.

对整个算法做个简要的说明:

  • 最终估计的是 $$F_M(x)$$,可以看做是一系列弱分类器的集合
    • $$F_M(x) = \sum_{m=0}^n \gamma_m h_m(x)$$
    • $$F_0(x)$$ 可以看做模型的常数项

    • 整个模型的思路类似Newton-Rapson方法,通过梯度下降的方式来逐步减小损失函数,并找到最优解
    • 模型的实现的是通过对“伪残差”再次进行训练来不断减小损失函数$$L(y,F)$$
    • 模型中优化问题的求解常常被省略掉,取而代之的是用一个给定的shrinkage参数来进行控制

    • 接下来,我会用平方损失(square-loss)和对数损失函数(log-loss)来对这个模型的流程做简单的推导



      模型推导



      平方损失


      平方损失又可以称作高斯损失,损失函数的定义如下: $$L(y_i,f(x_i)) = (y_i-f(x_i))^2/2$$ 损失函数求和有时需要增加权重: $$\frac{1}{2\sum \omega_i} \sum\omega_i(y_i-f(x_i))^2$$


      其中$$\omega_i$$ 是高斯损失的权重,平时不设置这个权重的话就全部取1, $$f(x)$$如果非中心化,就使用 $$f(x) + o_i$$ 来代替



      计算初始值


      根据上面的算法,计算初始值: $$F_0(x) = \underset{\gamma}{\arg\min} \sum_{i=1}^n L(y_i, \gamma) $$ $$\underset{\gamma}{\arg\min} \frac{1}{2\sum \omega_i}\sum \omega_i(y_i-f_0(x_i))^2$$.


      对上式中的损失函数求导并令其为0:


      $$\frac{\partial \frac{1}{2\sum \omega_i}\sum \omega_i(y_i-f_0(x_i))^2}{\partial f(x_i)} $$
      $$=\frac{1}{2\sum \omega_i}\sum \omega_i(y_i-f_0(x_i) - o_i) $$
      $$= 0 $$


      $$\sum f_0(x_i)\omega_i(y_i-f_0(x_i) - o_i) = 0$$ $$f_0(x) = \frac{\sum \omega_i(y_i-o_i)}{\sum \omega_i}$$


      特别的,如果模型是中心化的($$o_i =0$$),并且不设置权重($$\omega_i= 1$$)


      $$f_0(x) = \frac{\sum \omega_i(y_i)}{n} = \overline{y}$$




      计算伪残差


      这部分我直接使用默认的无权重情况进行迭代计算:


      $$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)}$$ $$ = -\left[\frac{\partial (y_i-f(x_i))^2/2)}{\partial f(x_i)}\right]_{f(x)=f_{m-1}(x)}$$ $$ = y_i - f_{m-1}(x_i)$$




      之后的部分



      • 拟合新模型(比如rpart分类树)所有数据对伪残差$$r_{im}$$

      • 更新模型

        • 去除线性规划部分,如果要用是一个N-R方法的近似值

        • $$F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)$$

        • 使用shrinkage参数代替 $$\gamma$$





      代码实现


      跟随肖老师的步伐,手写这段代码,模型训练使用rpart包来完成



      • 模型训练函数


      library(rpart)
      gbm_gaussian = function(data, shrinkage = 0.01, iter = 100,…){
      learner = list()
      ## Initial value
      f = mean(data$y == 1)
      ## pseudo-residuals
      data$dy = data$y - f
      for( i in 1:iter){
      learner[[i]] = rpart(dy ~x1+x2,data = data,
      control=rpart.control(maxdepth=1,cp=0))
      ## Improve model:new predict value for each observations
      f = f + shrinkagepredict(learner[[i]])
      ## update pseudo-residuals
      data$dy = data$y -f
      }
      result = list(‘learner’ = learner,
      ‘shrinkage’ = shrinkage,
      ‘iter’ = iter)
      return(result)
      }


      • 建立一个预测模型来对上述模型进行预测:


      predict_gbm_gaussian = function(model,newdata){
      predict_y = list()
      f = mean(data$y ==1)
      n = nrow(data)
      iter = model$iter
      shrinkage = model$shrinkage
      learner = model$learner
      predict_y[[1]] = f + shrinkage predict(learner[[1]],newdata = data)

      for(i in 2:iter){
      predict_y[[i]] = predict_y[[i-1]] + shrinkage predict(learner[[i]],newdata = data)
      }
      mse = sapply(predict_y,function(pre_y) sum((pre_y- data$y)^2/2))
      result = list(‘predict_y’ = predict_y,
      ‘mse’ = mse)
      return(result)
      }


      • 试试数据:

        • 训练模型



      data = iris[1:100,1:4]
      data$y = rep(0:1,each = 50)
      names(data)[1:2] = c("x1","x2")
      head(data)

      ##    x1  x2 Petal.Length Petal.Width y
      ## 1 5.1 3.5 1.4 0.2 0
      ## 2 4.9 3.0 1.4 0.2 0
      ## 3 4.7 3.2 1.3 0.2 0
      ## 4 4.6 3.1 1.5 0.2 0
      ## 5 5.0 3.6 1.4 0.2 0
      ## 6 5.4 3.9 1.7 0.4 0

      model = gbm_gaussian(data,0.001,1000)
      objects(model)

      ## [1] "iter"      "learner"   "shrinkage"

      final = predict_gbm_gaussian(model,data)$predict_y[[1000]]
      y_pred = ifelse(final>0.5,1,-1)
      table(data$y, y_pred)

      ##    y_pred
      ## -1 1
      ## 0 45 5
      ## 1 6 44

      试试GBM:


      library(gbm)

      ## Loading required package: survival
      ## Loading required package: splines
      ## Loading required package: lattice
      ## Loading required package: parallel
      ## Loaded gbm 2.1

      model = gbm(y~x1+x2,distribution = "gaussian",
      n.minobsinnode = 1,
      bag.fraction =1,
      data = data,n.trees = 1000,
      shrinkage=0.001
      )
      final = predict(model,data,1000)
      y_pred = ifelse(final>0.5,1,-1)
      table(data$y, y_pred)

      ##    y_pred
      ## -1 1
      ## 0 45 5
      ## 1 6 44




      logloss下的推导:


      logloss,对数损失函数的形式是: $log(p(y|x,w))$ 但是我们经常看到大家说对数损失函数是这样的形式($y \in {-1,1}$):


      $$L(y,\eta) = -log(p(y|x,w)) = log(1+e^{-y \eta})$$


      具体的原因是这样的,对于一个分类器,实际上可以理解为计算一个response的odds(比率):


      $$f(x_i) = log\frac{p(y=1|x_i,w)}{p(y=-1|x_i,w)} = \eta_i$$


      所以$P = p(y=1|x,w)$ 可以看作: $$\eta = log \frac{P}{1-P} $$
      $$p(y=1|x,w) = P = \frac{e^{\eta}}{1+e^{\eta}}$$


      $p(y=-1|x,w) = 1-P$ 可以看作:
      $$p(y=1|x,w) = 1- P = 1- \frac{e^{\eta}}{1+e^{\eta}}$$
      $$ p(y=1|x,w) = \frac{e^{-\eta}}{1+e^{-\eta}}$$


      合起来写就是: $$p(y|x,w) = \frac{e^{y_i\eta}}{1+e^{y_i\eta}}$$


      所以对数损失函数写成: $$L(y,\eta) = -log(p(y|x,w)) $$
      $$ = -log\frac{e^{y_i\eta}}{1+e^{y_i\eta}})$$
      $$ = log\frac{1+e^{y_i\eta}}{e^{y_i\eta}})$$
      $$= log(1+e^{-y \eta})$$



      计算初始值


      根据上面的算法,计算初始值: $$F_0(x) = \underset{\gamma}{\arg\min} \sum_{i=1}^n L(y_i, \gamma) \underset{\gamma}{\arg\min} \sum_{i=1}^n log(1+e^{-y f_0(x)})$$.


      对上式中的损失函数求导并令其为0:


      $$\frac{\partial \sum_{i=1}^n log(1+e^{-y f_0(x)})}{\partial f(x_i)} $$
      $$=\sum\frac{-yexp(-y)f_0(x)}{1+exp(-yf_0(x))}$$
      $$= \sum_{y=1} \frac{-exp(-f_0(x))}{1+exp(-f_0(x))} + \sum_{y=0} \frac{exp(f_0(x))}{1+exp(f_0(x))}$$
      $$= \overline{y}n\frac{-1}{1+exp(f_0(x))} + (1-\overline{y})n\frac{exp(f_0(x))}{1+exp(f_0(x))} = 0 $$
      $$(1-\overline{y})exp(f_0(x)) - \overline{y} = 0$$
      $$exp(f_0(x)) = \frac{\overline{y}}{1-\overline{y}}$$
      $$f_0(x) = log\frac{\overline{y}}{1-\overline{y}}$$




      计算伪残差


      $$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)}$$ $$ = -\left[\frac{\partial (log(1+e^{-yf(x_i)}))}{\partial f(x_i)}\right]_{f(x)=f_{m-1}(x)}$$ $$ = -\frac{-yexp(-yf(x))}{1+exp(-yf(x))}$$ $$ = \frac{y}{1+exp(yf(x))}$$




      代码实现


      这部分就不写了,贴肖老师的代码~ 来自这里这里这里这里


      library(ggplot2)

      ## Warning: package ‘ggplot2’ was built under R version 3.1.3

      # read data
      data = subset(iris,Species!=’virginica’,select = c(1,2,5))
      data$y = ifelse(data$Species == ‘setosa’,1,-1)
      data$Species = NULL
      names(data)[1:2] = c(‘x1’,’x2’)
      head(data)

      ##    x1  x2 y
      ## 1 5.1 3.5 1
      ## 2 4.9 3.0 1
      ## 3 4.7 3.2 1
      ## 4 4.6 3.1 1
      ## 5 5.0 3.6 1
      ## 6 5.4 3.9 1

      p = ggplot(data,aes(x1,x2,color=factor(y)))+
      geom_point()
      print(p)


      # train boosting tree for classification
      GBC = function(data,rate=0.1,iter=100){
      library(rpart)
      max_depth=1
      learner = list()
      mu = mean(data$y==1)
      # start with an initial model
      # mu=p(y=1) -> f=wx = log(mu/(1-mu))
      f = log(mu/(1-mu))
      data$dy = data$y/(1+exp(data$yf)) # dy is negtive gradient of log loss funtion
      for (i in 1:iter) {
      # use a weak model to improve
      learner[[i]] = rpart(dy~x1+x2,data=data,
      control=rpart.control(maxdepth=max_depth,cp=0))
      # improve model
      f = f + rate
      predict(learner[[i]])
      # modify dy
      data$dy = data$y/(1+exp(data$yf))
      }
      result = list(‘learner’=learner,
      ‘rate’=rate,
      ‘iter’=iter)
      return(result)
      }

      model = GBC(data,rate=0.1,iter=1000)
      objects(model)

      ## [1] "iter"    "learner" "rate"

      # predict function
      predict_GBC = function(data,model){
      predict_y = list()
      mu = mean(data$y==1)
      f = log(mu/(1-mu))
      n = nrow(data)
      iter = model$iter
      rate = model$rate

      predict_y[[1]] = rep(f,n)
      learner = model$learner
      for (i in 2:iter) {
      predict_y[[i]] = predict_y[[i-1]] + rate predict(learner[[i-1]],newdata=data)
      }
      mse = sapply(predict_y,function(pre_y) sum(log(1+exp(-data$y*pre_y)))) # logistic loss function
      result = list(‘predict_y’=predict_y,
      ‘mse’= mse)
      return(result)
      }

      # predict data
      predict_train = predict_GBC(data,model=model)
      objects(predict_train)

      ## [1] "mse"       "predict_y"

      plot(predict_train$mse,type=’l’)


      final = predict_train$predict_y[[1000]]
      y_p = 1/(1+exp(-final))
      y_pred = ifelse(y_p>0.5,1,-1)
      table(data$y, y_pred)

      ##     y_pred
      ## -1 1
      ## -1 50 0
      ## 1 0 50



      page PV:  ・  site PV:  ・  site UV: