Python,Sklearn[1]基本线性模型
Python的sklearn是Python中应用广泛的机器学习库.本博文内容来源于scikit-learn官网,这里我只是将代码打一遍,系统的学习下期中的内容,并看下逻辑.其实,如果真的能全部走一遍还是蛮有收获的,后续我会继续第二章,第三章的学习…
1.线性模型¶
1.1 LM¶
In [5]:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[0,0],[1,1],[2,2]],[0,1,2])
Out[5]:
- 参数的查看
In [6]:
clf.coef_
Out[6]:
In [7]:
from sklearn import linear_model
clf = linear_model.Ridge(alpha = 0.5)
clf.fit([[0,0],[0,0],[1,1]],[0,0.1,1])
Out[7]:
- 查看参数
In [8]:
clf.coef_
Out[8]:
- 查看截距
In [9]:
clf.intercept_
Out[9]:
- 交叉验证的Ridge Regression
In [10]:
from sklearn import linear_model
clf = linear_model.RidgeCV(alphas = [0.1, 0.5, 1])
clf.fit([[0,0],[0,0],[1,1]],[0,.1,1])
Out[10]:
- 查看交叉验证所确定的$\alpha$
In [11]:
clf.alpha_
Out[11]:
In [12]:
clf = linear_model.Lasso(alpha = 0.1)
clf.fit([[0,0],[1,1]],[0,1])
Out[12]:
- 预测新变量
In [13]:
clf.predict([[1,1]])
Out[13]:
In [15]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
###################################
# generate some sparse data to play with
np.random.seed(42)
n_sample, n_features = 50,200
X = np.random.randn(n_sample, n_features)
coef = 3 * np.random.randn(n_features)
inds = np.arange(n_features)
np.random.shuffle(inds)
coef[inds[10:]] = 0
y = np.dot(X,coef)
y
Out[15]:
In [17]:
## add noise
y += 0.01 * np.random.normal((n_sample))
## Split data in train and test set
n_samples = X.shape[0]
X_train, y_train = X[:n_sample / 2], y[:n_sample / 2]
X_test, y_test = X[n_sample / 2:], y[n_sample / 2:]
- Lasso for this data
In [18]:
#####Lasso
from sklearn.linear_model import Lasso
alpha = 0.1
lasso = Lasso(alpha = alpha)
y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
r2_pred_lasso = r2_score(y_test,y_pred_lasso)
print(lasso)
In [20]:
print("r^2 on test data: %f" % r2_pred_lasso)
- Elastic Net for this data
In [23]:
from sklearn.linear_model import ElasticNet
enet = ElasticNet(alpha= alpha,l1_ratio = 0.7)
ypred_enet = enet.fit(X_train, y_train).predict(X_test)
r2_score_enet = r2_score(y_test, ypred_enet)
print(enet)
In [25]:
print("r^2 on test data : %f" % r2_score_enet)
In [27]:
%pylab inline
plt.plot(enet.coef_, label = 'Elastic net coeffcients')
plt.plot(lasso.coef_, label = 'Lasso coeffcients')
plt.plot(coef, "–", label = 'original coeffcients')
plt.legend(loc = "best")
plt.title("Lasso R^2: %f, Elastic Net R^2: %f"
% (r2_pred_lasso,r2_score_enet))
plt.show()
1.5 multi-task lasso¶
与传统的lasso相比,muti-task的lasso的因变量y是个二维变量(n_sample,n_task) 应看做是panel data,
n_sample代表n个观测值
n_task代表task次观测,比如,一个月有30天的数据,10支股票观测10次,用multi-task来进行回归
multi-task lasso希望可以找到对于所有task都有用的变量
In [6]:
%pylab inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import MultiTaskLasso, Lasso
rng = np.random.RandomState(42)
## 通过随机生成的方式建立一个2D的参数
n_samples, n_features, n_tasks = 100, 30, 40
n_relevant_features = 5
coef = np.zeros((n_tasks, n_features))
times = np.linspace(0, 2 np.pi, n_tasks)
for k in range(n_relevant_features):
coef[:, k] = np.sin((1. + rng.randn(1)) times + 3 * rng.randn(1))
X = rng.randn(n_samples, n_features)
Y = np.dot(X, coef.T) + rng.randn(n_samples, n_tasks)
coef_lasso = np.array([Lasso(alpha = 0.5).fit(X, y).coef_ for y in Y.T])
coef_multi_task_lasso_ = MultiTaskLasso(alpha = 1.).fit(X, Y).coef_
fig = plt.figure(figsize=(8, 5))
plt.subplot(1, 2, 1)
plt.spy(coef_lasso)
plt.xlabel('Feature')
plt.ylabel('Time (or Task)')
plt.text(10, 5, 'Lasso')
plt.subplot(1, 2, 2)
plt.spy(coef_multi_task_lasso_)
plt.xlabel('Feature')
plt.ylabel('Time (or Task)')
plt.text(10, 5, 'MultiTaskLasso')
fig.suptitle('Coefficient non-zero location')
feature_to_plot = 0
plt.figure()
plt.plot(coef[:, feature_to_plot], 'k', label='Ground truth')
plt.plot(coef_lasso[:, feature_to_plot], 'g', label='Lasso')
plt.plot(coef_multi_task_lasso_[:, feature_to_plot],
'r', label='MultiTaskLasso')
plt.legend(loc='upper center')
plt.axis('tight')
plt.ylim([-1.1, 1.1])
plt.show()
In [13]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn import datasets
diabetes = datasets.load_diabetes()
X = diabetes.data
Y = diabetes.target
print("Computing regularization path using the LARS …")
alphas, _, coefs = linear_model.lars_path(X, Y, method = 'lasso',verbose = True)
xx = np.sum(np.abs(coefs.T), axis = 1)
xx /= xx[-1]
plt.plot(xx, coefs.T)
ymin, ymax = plt.ylim()
plt.vlines(xx, ymin, ymax, linestyle='dashed')
plt.xlabel('|coef| / max|coef|')
plt.ylabel('Coefficients')
plt.title('LASSO Path')
plt.axis('tight')
plt.show()
1.8 Orthogonal Matching Pursuit(OMP)¶
正交匹配追踪
- 看做L0正则化的实现
- L0正则化可以看做最小化最大参数(L0就是最大的那个参数)
- 等价于
$$argmin||\gamma||_0\quad subject\quad to ||y-X\gamma||^2_2 \leq tol$$
- 使用的函数
OrthogonalMatchingPursuit
和orthogonal_mp
In [14]:
from sklearn import linear_model
X = [[0,0],[1,1],[2,2],[3,3]]
Y = [0,1,2,3]
clf = linear_model.BayesianRidge()
clf.fit(X,Y)
Out[14]:
In [15]:
clf.predict([[1,0]])
Out[15]:
In [16]:
clf.coef_
Out[16]:
In [17]:
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.arange(6).reshape(3,2)
X
Out[17]:
In [18]:
ploy = PolynomialFeatures(degree = 2)
ploy.fit_transform(X)
Out[18]:
- 可以使用pipeline这样的工具来将多项式生成与线性回归组合起来
In [23]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
model = Pipeline([('poly',PolynomialFeatures(degree = 3)),
('linear',LinearRegression(fit_intercept=False))])
x = np.arange(5)
y = 3 - 2 * x + x 2 - x 3
model = model.fit(x[:,np.newaxis],y)
model.named_steps['linear'].coef_
Out[23]: