Scikit Learn[1]线性模型¶

Python的sklearn是Python中应用广泛的机器学习库.本博文内容来源于scikit-learn官网,这里我只是将代码打一遍,系统的学习下期中的内容,并看下逻辑.其实,如果真的能全部走一遍还是蛮有收获的,后续我会继续第二章,第三章的学习…

1.线性模型¶

1.1 LM¶

from sklearn import linear_modelclf = linear_model.LinearRegression()clf.fit([[0,0],[1,1],[2,2]],[0,1,2])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

参数的查看

clf.coef_

array([ 0.5,  0.5])

1.2Ridge Regression¶

from sklearn import linear_modelclf = linear_model.Ridge(alpha = 0.5)clf.fit([[0,0],[0,0],[1,1]],[0,0.1,1])

Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,   normalize=False, solver='auto', tol=0.001)

查看参数

clf.coef_

array([ 0.34545455,  0.34545455])

查看截距

clf.intercept_

0.13636363636363641

交叉验证的Ridge Regression

from sklearn import linear_modelclf = linear_model.RidgeCV(alphas = [0.1, 0.5, 1])clf.fit([[0,0],[0,0],[1,1]],[0,.1,1])

RidgeCV(alphas=[0.1, 0.5, 1], cv=None, fit_intercept=True, gcv_mode=None,    normalize=False, scoring=None, store_cv_values=False)

查看交叉验证所确定的$\alpha$

clf.alpha_

0.10000000000000001

1.3 LASSO¶

clf = linear_model.Lasso(alpha = 0.1)clf.fit([[0,0],[1,1]],[0,1])

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,   normalize=False, positive=False, precompute=False, random_state=None,   selection='cyclic', tol=0.0001, warm_start=False)

预测新变量

clf.predict([[1,1]])

array([ 0.8])

1.4 Elastic Net¶

http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_and_elasticnet.html#example-linear-model-plot-lasso-and-elasticnet-py

信号处理的例子

import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import r2_score#################################### generate some sparse data to play withnp.random.seed(42)n_sample, n_features = 50,200X = np.random.randn(n_sample, n_features)coef = 3 * np.random.randn(n_features)inds = np.arange(n_features)np.random.shuffle(inds)coef[inds[10:]] = 0y = np.dot(X,coef)y

array([  1.85454808,  -1.88185275,  -3.22753722,  -1.29637962,        12.21021156,   3.09834518,   0.83311376,  12.60558   ,         3.29040286,   1.41273462,  -9.24393601,   0.82443959,        -5.38482466,   8.74127838, -16.23212532,   7.61936206,       -10.89723881,  -5.88004681,  -1.75512379,  -0.46345128,        -5.8950359 ,  17.7640346 ,  -3.97835215, -19.89832756,        -2.75872327,  -7.07454408,   3.74977501,  20.06851876,        -4.15553144,  -8.24155577,   4.26803734,   3.33670968,        13.20475772,   2.44885748, -10.51464129, -13.90984428,        -5.76433803,  -1.73589121, -18.96316779,  -9.77324314,         8.38704103,  -9.83929643,  15.54698292,   0.15803178,        -6.68972473,  -3.45724035, -10.4518149 ,  -5.885115  ,        -6.83404273,  -0.98061547])

## add noisey += 0.01 * np.random.normal((n_sample))## Split data in train and test setn_samples = X.shape[0]X_train, y_train = X[:n_sample / 2], y[:n_sample / 2]X_test, y_test = X[n_sample / 2:], y[n_sample / 2:]

Lasso for this data

#####Lassofrom sklearn.linear_model import Lassoalpha = 0.1lasso = Lasso(alpha = alpha)y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)r2_pred_lasso = r2_score(y_test,y_pred_lasso)print(lasso)

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,   normalize=False, positive=False, precompute=False, random_state=None,   selection='cyclic', tol=0.0001, warm_start=False)

print("r^2 on test data: %f" % r2_pred_lasso)

r^2 on test data: 0.384710

Elastic Net for this data

from sklearn.linear_model import ElasticNetenet = ElasticNet(alpha= alpha,l1_ratio = 0.7)ypred_enet = enet.fit(X_train, y_train).predict(X_test)r2_score_enet = r2_score(y_test, ypred_enet)print(enet)

ElasticNet(alpha=0.1, copy_X=True, fit_intercept=True, l1_ratio=0.7,      max_iter=1000, normalize=False, positive=False, precompute=False,      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

print("r^2 on test data : %f" % r2_score_enet)

r^2 on test data : 0.240176

%pylab inlineplt.plot(enet.coef_, label = 'Elastic net coeffcients')plt.plot(lasso.coef_, label = 'Lasso coeffcients')plt.plot(coef, "–", label = 'original coeffcients')plt.legend(loc = "best")plt.title("Lasso R^2: %f, Elastic Net R^2: %f"         % (r2_pred_lasso,r2_score_enet))plt.show()

Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['clf']`%matplotlib` prevents importing * from pylab and numpy

1.5 multi-task lasso¶

与传统的lasso相比,muti-task的lasso的因变量y是个二维变量(n_sample,n_task) 应看做是panel data,

n_sample代表n个观测值
n_task代表task次观测,比如,一个月有30天的数据,10支股票观测10次,用multi-task来进行回归

multi-task lasso希望可以找到对于所有task都有用的变量

%pylab inlineimport matplotlib.pyplot as pltimport numpy as npfrom sklearn.linear_model import MultiTaskLasso, Lassorng = np.random.RandomState(42)## 通过随机生成的方式建立一个2D的参数n_samples, n_features, n_tasks = 100, 30, 40n_relevant_features = 5coef = np.zeros((n_tasks, n_features))times = np.linspace(0, 2  np.pi, n_tasks)for k in range(n_relevant_features):    coef[:, k] = np.sin((1. + rng.randn(1))  times + 3 * rng.randn(1))X = rng.randn(n_samples, n_features)Y = np.dot(X, coef.T) + rng.randn(n_samples, n_tasks)coef_lasso = np.array([Lasso(alpha = 0.5).fit(X, y).coef_ for y in Y.T])coef_multi_task_lasso_ = MultiTaskLasso(alpha = 1.).fit(X, Y).coef_fig = plt.figure(figsize=(8, 5))plt.subplot(1, 2, 1)plt.spy(coef_lasso)plt.xlabel('Feature')plt.ylabel('Time (or Task)')plt.text(10, 5, 'Lasso')plt.subplot(1, 2, 2)plt.spy(coef_multi_task_lasso_)plt.xlabel('Feature')plt.ylabel('Time (or Task)')plt.text(10, 5, 'MultiTaskLasso')fig.suptitle('Coefficient non-zero location')feature_to_plot = 0plt.figure()plt.plot(coef[:, feature_to_plot], 'k', label='Ground truth')plt.plot(coef_lasso[:, feature_to_plot], 'g', label='Lasso')plt.plot(coef_multi_task_lasso_[:, feature_to_plot],         'r', label='MultiTaskLasso')plt.legend(loc='upper center')plt.axis('tight')plt.ylim([-1.1, 1.1])plt.show()

Populating the interactive namespace from numpy and matplotlib

1.6 Least Angle Regression¶

最小角度回归(LAR)是用于解决高维回归问题的一个方法,Efron发明的…

funciton:

Lars

lars_path

1.7 LARS Lasso¶

一个基于LARS实现的Lasso模型

计算不同参数的稀疏路径

import numpy as npimport matplotlib.pyplot as pltfrom sklearn import linear_modelfrom sklearn import datasetsdiabetes = datasets.load_diabetes()X = diabetes.dataY = diabetes.targetprint("Computing regularization path using the LARS …")alphas, _, coefs = linear_model.lars_path(X, Y, method = 'lasso',verbose = True)xx = np.sum(np.abs(coefs.T), axis = 1)xx /= xx[-1]plt.plot(xx, coefs.T)ymin, ymax = plt.ylim()plt.vlines(xx, ymin, ymax, linestyle='dashed')plt.xlabel('|coef| / max|coef|')plt.ylabel('Coefficients')plt.title('LASSO Path')plt.axis('tight')plt.show()

Computing regularization path using the LARS ….

1.8 Orthogonal Matching Pursuit(OMP)¶

正交匹配追踪

看做L0正则化的实现

L0正则化可以看做最小化最大参数(L0就是最大的那个参数)

等价于

$$argmin||\gamma||_0\quad subject\quad to ||y-X\gamma||^2_2 \leq tol$$

使用的函数

OrthogonalMatchingPursuit和orthogonal_mp

1.9.1 Bayesian Ridge Regression¶

function: BayesianRidge

from sklearn import linear_modelX = [[0,0],[1,1],[2,2],[3,3]]Y = [0,1,2,3]clf = linear_model.BayesianRidge()clf.fit(X,Y)

BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,       fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,       normalize=False, tol=0.001, verbose=False)

clf.predict([[1,0]])

array([ 0.50000013])

clf.coef_

array([ 0.49999993,  0.49999993])

1.9.2 Automatic Relevance Determination - ARD¶

与Bayesian Ridge Regression相似

给出一个新的先验

ARDRegression

1.10 Logistic Regression¶

这部分会在后期关于随机梯度下降(SGD)的内容中给出

1.11 Polynomial regression¶

多项式回归

function: sklearn.processing

from sklearn.preprocessing import PolynomialFeaturesimport numpy as npX = np.arange(6).reshape(3,2)X

array([[0, 1],       [2, 3],       [4, 5]])

ploy = PolynomialFeatures(degree = 2)ploy.fit_transform(X)

array([[ 1,  0,  1,  0,  0,  1],       [ 1,  2,  3,  4,  6,  9],       [ 1,  4,  5, 16, 20, 25]])

可以使用pipeline这样的工具来将多项式生成与线性回归组合起来

from sklearn.preprocessing import PolynomialFeaturesfrom sklearn.linear_model import LinearRegressionfrom sklearn.pipeline import Pipelinemodel = Pipeline([('poly',PolynomialFeatures(degree = 3)),                 ('linear',LinearRegression(fit_intercept=False))])x = np.arange(5)y = 3 - 2 * x + x  2 - x  3model = model.fit(x[:,np.newaxis],y)model.named_steps['linear'].coef_

array([ 3., -2.,  1., -1.])

Python,Sklearn[1]基本线性模型