不写R包的分析师不是好全栈

Python,Sklearn[1]基本线性模型

    JsPy&Others









Scikit Learn[1]线性模型










Python的sklearn是Python中应用广泛的机器学习库.本博文内容来源于scikit-learn官网,这里我只是将代码打一遍,系统的学习下期中的内容,并看下逻辑.其实,如果真的能全部走一遍还是蛮有收获的,后续我会继续第二章,第三章的学习…


1.线性模型

1.1 LM







In [5]:



from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[0,0],[1,1],[2,2]],[0,1,2])







Out[5]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)








  • 参数的查看








In [6]:



clf.coef_







Out[6]:

array([ 0.5,  0.5])







1.2Ridge Regression







In [7]:



from sklearn import linear_model
clf = linear_model.Ridge(alpha = 0.5)
clf.fit([[0,0],[0,0],[1,1]],[0,0.1,1])







Out[7]:

Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, solver='auto', tol=0.001)








  • 查看参数








In [8]:



clf.coef_







Out[8]:

array([ 0.34545455,  0.34545455])








  • 查看截距








In [9]:



clf.intercept_







Out[9]:

0.13636363636363641








  • 交叉验证的Ridge Regression








In [10]:



from sklearn import linear_model
clf = linear_model.RidgeCV(alphas = [0.1, 0.5, 1])
clf.fit([[0,0],[0,0],[1,1]],[0,.1,1])







Out[10]:

RidgeCV(alphas=[0.1, 0.5, 1], cv=None, fit_intercept=True, gcv_mode=None,
normalize=False, scoring=None, store_cv_values=False)








  • 查看交叉验证所确定的$\alpha$








In [11]:



clf.alpha_







Out[11]:

0.10000000000000001







1.3 LASSO







In [12]:



clf = linear_model.Lasso(alpha = 0.1)
clf.fit([[0,0],[1,1]],[0,1])







Out[12]:

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)








  • 预测新变量








In [13]:



clf.predict([[1,1]])







Out[13]:

array([ 0.8])





In [15]:



import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import r2_score

###################################
# generate some sparse data to play with

np.random.seed(42)

n_sample, n_features = 50,200
X = np.random.randn(n_sample, n_features)
coef = 3 * np.random.randn(n_features)
inds = np.arange(n_features)
np.random.shuffle(inds)
coef[inds[10:]] = 0
y = np.dot(X,coef)
y







Out[15]:

array([  1.85454808,  -1.88185275,  -3.22753722,  -1.29637962,
12.21021156, 3.09834518, 0.83311376, 12.60558 ,
3.29040286, 1.41273462, -9.24393601, 0.82443959,
-5.38482466, 8.74127838, -16.23212532, 7.61936206,
-10.89723881, -5.88004681, -1.75512379, -0.46345128,
-5.8950359 , 17.7640346 , -3.97835215, -19.89832756,
-2.75872327, -7.07454408, 3.74977501, 20.06851876,
-4.15553144, -8.24155577, 4.26803734, 3.33670968,
13.20475772, 2.44885748, -10.51464129, -13.90984428,
-5.76433803, -1.73589121, -18.96316779, -9.77324314,
8.38704103, -9.83929643, 15.54698292, 0.15803178,
-6.68972473, -3.45724035, -10.4518149 , -5.885115 ,
-6.83404273, -0.98061547])




In [17]:



## add noise
y += 0.01 * np.random.normal((n_sample))

## Split data in train and test set
n_samples = X.shape[0]
X_train, y_train = X[:n_sample / 2], y[:n_sample / 2]
X_test, y_test = X[n_sample / 2:], y[n_sample / 2:]









  • Lasso for this data








In [18]:



#####Lasso
from sklearn.linear_model import Lasso

alpha = 0.1
lasso = Lasso(alpha = alpha)

y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_pred_lasso = r2_score(y_test,y_pred_lasso)
print(lasso)









Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)





In [20]:



print("r^2 on test data: %f" % r2_pred_lasso)









r^2 on test data: 0.384710









  • Elastic Net for this data








In [23]:



from sklearn.linear_model import ElasticNet

enet = ElasticNet(alpha= alpha,l1_ratio = 0.7)

ypred_enet = enet.fit(X_train, y_train).predict(X_test)
r2_score_enet = r2_score(y_test, ypred_enet)
print(enet)









ElasticNet(alpha=0.1, copy_X=True, fit_intercept=True, l1_ratio=0.7,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=None, selection='cyclic', tol=0.0001, warm_start=False)





In [25]:



print("r^2 on test data : %f" % r2_score_enet)









r^2 on test data : 0.240176





In [27]:



%pylab inline

plt.plot(enet.coef, label = 'Elastic net coeffcients')
plt.plot(lasso.coef, label = 'Lasso coeffcients')
plt.plot(coef, "–", label = 'original coeffcients')
plt.legend(loc = "best")
plt.title("Lasso R^2: %f, Elastic Net R^2: %f"
% (r2_pred_lasso,r2_score_enet))
plt.show()









Populating the interactive namespace from numpy and matplotlib




WARNING: pylab import has clobbered these variables: ['clf']
`%matplotlib` prevents importing * from pylab and numpy










1.5 multi-task lasso

与传统的lasso相比,muti-task的lasso的因变量y是个二维变量(n_sample,n_task) 应看做是panel data,


n_sample代表n个观测值
n_task代表task次观测,比如,一个月有30天的数据,10支股票观测10次,用multi-task来进行回归


multi-task lasso希望可以找到对于所有task都有用的变量








In [6]:



%pylab inline

import matplotlib.pyplot as plt
import numpy as np

from sklearn.linear_model import MultiTaskLasso, Lasso

rng = np.random.RandomState(42)

## 通过随机生成的方式建立一个2D的参数
n_samples, n_features, n_tasks = 100, 30, 40
n_relevant_features = 5

coef = np.zeros((n_tasks, n_features))
times = np.linspace(0, 2 np.pi, n_tasks)

for k in range(n_relevant_features):
coef[:, k] = np.sin((1. + rng.randn(1)) times + 3 * rng.randn(1))

X = rng.randn(n_samples, n_features)
Y = np.dot(X, coef.T) + rng.randn(n_samples, n_tasks)

coeflasso = np.array([Lasso(alpha = 0.5).fit(X, y).coef for y in Y.T])
coef_multi_tasklasso = MultiTaskLasso(alpha = 1.).fit(X, Y).coef_



fig = plt.figure(figsize=(8, 5))
plt.subplot(1, 2, 1)
plt.spy(coef_lasso)
plt.xlabel('Feature')
plt.ylabel('Time (or Task)')
plt.text(10, 5, 'Lasso')
plt.subplot(1, 2, 2)
plt.spy(coef_multi_tasklasso)
plt.xlabel('Feature')
plt.ylabel('Time (or Task)')
plt.text(10, 5, 'MultiTaskLasso')
fig.suptitle('Coefficient non-zero location')

feature_to_plot = 0
plt.figure()
plt.plot(coef[:, feature_to_plot], 'k', label='Ground truth')
plt.plot(coef_lasso[:, feature_to_plot], 'g', label='Lasso')
plt.plot(coef_multi_tasklasso[:, feature_to_plot],
'r', label='MultiTaskLasso')
plt.legend(loc='upper center')
plt.axis('tight')
plt.ylim([-1.1, 1.1])
plt.show()









Populating the interactive namespace from numpy and matplotlib












1.6 Least Angle Regression

最小角度回归(LAR)是用于解决高维回归问题的一个方法,Efron发明的…


funciton:



  • Lars

  • lars_path











1.7 LARS Lasso

一个基于LARS实现的Lasso模型


计算不同参数的稀疏路径








In [13]:



import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn import datasets

diabetes = datasets.loaddiabetes()
X = diabetes.data
Y = diabetes.target

print("Computing regularization path using the LARS …")
alphas, , coefs = linear_model.lars_path(X, Y, method = 'lasso',verbose = True)

xx = np.sum(np.abs(coefs.T), axis = 1)
xx /= xx[-1]

plt.plot(xx, coefs.T)
ymin, ymax = plt.ylim()
plt.vlines(xx, ymin, ymax, linestyle='dashed')
plt.xlabel('|coef| / max|coef|')
plt.ylabel('Coefficients')
plt.title('LASSO Path')
plt.axis('tight')
plt.show()









Computing regularization path using the LARS …
.










1.8 Orthogonal Matching Pursuit(OMP)

正交匹配追踪



  • 看做L0正则化的实现

  • L0正则化可以看做最小化最大参数(L0就是最大的那个参数)

  • 等价于


$$argmin||\gamma||_0\quad subject\quad to ||y-X\gamma||^2_2 \leq tol$$

  • 使用的函数

  • OrthogonalMatchingPursuitorthogonal_mp











1.9.1 Bayesian Ridge Regression


  • function: BayesianRidge








In [14]:



from sklearn import linear_model
X = [[0,0],[1,1],[2,2],[3,3]]
Y = [0,1,2,3]
clf = linear_model.BayesianRidge()
clf.fit(X,Y)







Out[14]:

BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, compute_score=False, copy_X=True,
fit_intercept=True, lambda_1=1e-06, lambda_2=1e-06, n_iter=300,
normalize=False, tol=0.001, verbose=False)




In [15]:



clf.predict([[1,0]])







Out[15]:

array([ 0.50000013])




In [16]:



clf.coef_







Out[16]:

array([ 0.49999993,  0.49999993])







1.9.2 Automatic Relevance Determination - ARD


  • 与Bayesian Ridge Regression相似

  • 给出一个新的先验

  • ARDRegression











1.10 Logistic Regression

这部分会在后期关于随机梯度下降(SGD)的内容中给出











1.11 Polynomial regression

多项式回归



  • function: sklearn.processing








In [17]:



from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.arange(6).reshape(3,2)
X







Out[17]:

array([[0, 1],
[2, 3],
[4, 5]])




In [18]:



ploy = PolynomialFeatures(degree = 2)
ploy.fit_transform(X)







Out[18]:

array([[ 1,  0,  1,  0,  0,  1],
[ 1, 2, 3, 4, 6, 9],
[ 1, 4, 5, 16, 20, 25]])








  • 可以使用pipeline这样的工具来将多项式生成与线性回归组合起来








In [23]:



from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
model = Pipeline([('poly',PolynomialFeatures(degree = 3)),
('linear',LinearRegression(fit_intercept=False))])

x = np.arange(5)
y = 3 - 2 x + x * 2 - x 3
model = model.fit(x[:,np.newaxis],y)
model.namedsteps['linear'].coef







Out[23]:

array([ 3., -2.,  1., -1.])




page PV:  ・  site PV:  ・  site UV: