Scikit-learn has a very clean and consistent API, making it very easy to use: a similar workflow can be applied to most techniques. Let’s go over two examples.
This code was modified from Matthew Greenberg .
Load packages
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error, accuracy_score
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
import numpy as np
from collections import Counter
ModuleNotFoundError: No module named 'sklearn'
Example 1: California housing dataset
Load and explore the data
cal_housing = fetch_california_housing()
type (cal_housing)
NameError: name 'fetch_california_housing' is not defined
Let’s look at the attributes of cal_housing
:
NameError: name 'cal_housing' is not defined
cal_housing.feature_names
NameError: name 'cal_housing' is not defined
NameError: name 'cal_housing' is not defined
X = cal_housing.data
y = cal_housing.target
NameError: name 'cal_housing' is not defined
This can also be obtained with X, y = fetch_california_housing(return_X_y=True)
.
Let’s have a look at the shape of X
and y
:
NameError: name 'X' is not defined
NameError: name 'y' is not defined
While not at all necessary, we can turn this bunch object into a more familiar data frame to explore the data further:
cal_housing_df = pd.DataFrame(cal_housing.data, columns= cal_housing.feature_names)
NameError: name 'pd' is not defined
NameError: name 'cal_housing_df' is not defined
NameError: name 'cal_housing_df' is not defined
NameError: name 'cal_housing_df' is not defined
cal_housing_df.describe()
NameError: name 'cal_housing_df' is not defined
We can even plot it:
NameError: name 'plt' is not defined
Create and fit a model
Let’s start with a very simple model: linear regression.
model = LinearRegression().fit(X, y)
NameError: name 'LinearRegression' is not defined
This is equivalent to:
model = LinearRegression()
model.fit(X, y)
First, we create an instance of the class LinearRegression
, then we call .fit()
on it to fit the model.
NameError: name 'model' is not defined
Trailing underscores indicate that an attribute is estimated. .coef_
here is an estimated value.
NameError: name 'model' is not defined
NameError: name 'model' is not defined
We can now get our predictions:
NameError: name 'model' is not defined
And calculate some measures of error:
NameError: name 'np' is not defined
mean_squared_error(y, y_hat)
NameError: name 'mean_squared_error' is not defined
MSE could also be calculated with np.mean((y - y_hat)**2)
.
mean_absolute_percentage_error(y, y_hat)
NameError: name 'mean_absolute_percentage_error' is not defined
Index of minimum value:
NameError: name 'model' is not defined
Index of maximum value:
NameError: name 'model' is not defined
XX = np.concatenate([np.ones((len (X), 1 )), X], axis= 1 )
beta = np.linalg.lstsq(XX, y, rcond= None )[0 ]
intercept_, * coef_ = beta
intercept_, model.intercept_
NameError: name 'np' is not defined
np.allclose(coef_, model.coef_)
NameError: name 'np' is not defined
This means that the two arrays are equal element-wise, within a certain tolerance.
X_test = np.random.normal(size= (10 , X.shape[1 ]))
X_test.shape
NameError: name 'np' is not defined
y_test = X_test @ coef_ + intercept_
y_test
NameError: name 'X_test' is not defined
NameError: name 'model' is not defined
Of course, instead of LinearRegression()
, we could have used another model such as a random forest regressor (a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting) for instance:
model = RandomForestRegressor().fit(X, y).predict(X_test)
model
NameError: name 'RandomForestRegressor' is not defined
Which is equivalent to:
model = RandomForestRegressor()
model.fit(X, y).predict(X_test)
Example 2: breast cancer
Load and explore the data
b_cancer = load_breast_cancer()
NameError: name 'load_breast_cancer' is not defined
Let’s print the description of this dataset:
NameError: name 'b_cancer' is not defined
NameError: name 'b_cancer' is not defined
NameError: name 'b_cancer' is not defined
X = b_cancer.data
y = b_cancer.target
NameError: name 'b_cancer' is not defined
Here again, we could have used instead X, y = load_breast_cancer(return_X_y=True)
.
NameError: name 'X' is not defined
NameError: name 'y' is not defined
NameError: name 'y' is not defined
NameError: name 'Counter' is not defined
Create and fit a first model
model = LogisticRegression(max_iter= 10000 )
y_hat = model.fit(X, y).predict(X)
NameError: name 'LogisticRegression' is not defined
Get some measure of accuracy:
NameError: name 'accuracy_score' is not defined
This can also be obtained with:
def sigmoid(x):
return 1 / (1 + np.exp(- x))
x = np.linspace(- 10 , 10 , 100 )
plt.plot(x, sigmoid(x), lw= 3 )
plt.title("The Sigmoid Function $ \\ sigma(x)$" )
NameError: name 'np' is not defined
y_pred = 1 * (sigmoid(X @ model.coef_.squeeze() + model.intercept_) > 0.5 )
assert np.all (y_pred == model.predict(X))
np.allclose(
model.predict_proba(X)[:, 1 ],
sigmoid(X @ model.coef_.squeeze() + model.intercept_)
)
NameError: name 'X' is not defined
def make_spirals(k= 20 , s= 1.0 , n= 2000 ):
X = np.zeros((n, 2 ))
y = np.round (np.random.uniform(size= n)).astype(int )
r = np.random.uniform(size= n)* k* np.pi
rr = r** 0.5
theta = rr + np.random.normal(loc= 0 , scale= s, size= n)
theta[y == 1 ] = theta[y == 1 ] + np.pi
X[:,0 ] = rr* np.cos(theta)
X[:,1 ] = rr* np.sin(theta)
return X, y
X, y = make_spirals()
cmap = matplotlib.colormaps["viridis" ]
a = cmap(0 )
a = [* a[:3 ], 0.3 ]
b = cmap(0.99 )
b = [* b[:3 ], 0.3 ]
plt.figure(figsize= (7 ,7 ))
ax = plt.gca()
ax.set_aspect("equal" )
ax.plot(X[y == 0 , 0 ], X[y == 0 , 1 ], 'o' , color= a, ms= 8 , label= "$y=0$" )
ax.plot(X[y == 1 , 0 ], X[y == 1 , 1 ], 'o' , color= b, ms= 8 , label= "$y=1$" )
plt.title("Spirals" )
plt.legend()
NameError: name 'np' is not defined
Create and fit a second model
Here, we use a logistic regression:
model = LogisticRegression()
y_hat = model.fit(X, y).predict(X)
accuracy_score(y, y_hat)
NameError: name 'LogisticRegression' is not defined
u = np.linspace(- 8 , 8 , 100 )
v = np.linspace(- 8 , 8 , 100 )
U, V = np.meshgrid(u, v)
UV = np.array([U.ravel(), V.ravel()]).T
U.shape, V.shape, UV.shape
NameError: name 'np' is not defined
np.ravel
returns a contiguous flattened array.
W = model.predict(UV).reshape(U.shape)
W.shape
NameError: name 'model' is not defined
NameError: name 'plt' is not defined
Create and fit a third model
Let’s use a k-nearest neighbours classifier this time:
model = KNeighborsClassifier(n_neighbors= 5 )
y_hat = model.fit(X, y).predict(X)
accuracy_score(y, y_hat)
NameError: name 'KNeighborsClassifier' is not defined
u = np.linspace(- 8 , 8 , 100 )
v = np.linspace(- 8 , 8 , 100 )
U, V = np.meshgrid(u, v)
UV = np.array([U.ravel(), V.ravel()]).T
U.shape, V.shape, UV.shape
NameError: name 'np' is not defined
W = model.predict(UV).reshape(U.shape)
W.shape
NameError: name 'model' is not defined
NameError: name 'plt' is not defined
We can iterate over various values of k
to see how the accuracy and pseudocolor plot evolve:
fig, axes = plt.subplots(2 , 4 , figsize= (9.8 , 5 ))
fig.suptitle("Decision Regions" )
u = np.linspace(- 8 , 8 , 100 )
v = np.linspace(- 8 , 8 , 100 )
U, V = np.meshgrid(u, v)
UV = np.array([U.ravel(), V.ravel()]).T
ks = np.arange(1 , 16 , 2 )
for k, ax in zip (ks, axes.ravel()):
model = KNeighborsClassifier(n_neighbors= k)
model.fit(X, y)
acc = accuracy_score(y, model.predict(X))
W = model.predict(UV).reshape(U.shape)
ax.imshow(W, origin= "lower" , cmap= cmap)
ax.set_axis_off()
ax.set_title(f"$k$= { k} , acc= { acc:.2f} " )
NameError: name 'plt' is not defined