Sklearn workflow

Author

Marie-Hélène Burle

Scikit-learn has a very clean and consistent API, making it very easy to use: a similar workflow can be applied to most techniques. Let’s go over two examples.

This code was modified from Matthew Greenberg.

Load packages

from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error, accuracy_score

import pandas as pd

import matplotlib
from matplotlib import pyplot as plt

import numpy as np

from collections import Counter
ModuleNotFoundError: No module named 'sklearn'

Example 1: California housing dataset

Load and explore the data

cal_housing = fetch_california_housing()
type(cal_housing)
NameError: name 'fetch_california_housing' is not defined

Let’s look at the attributes of cal_housing:

dir(cal_housing)
NameError: name 'cal_housing' is not defined
cal_housing.feature_names
NameError: name 'cal_housing' is not defined
print(cal_housing.DESCR)
NameError: name 'cal_housing' is not defined
X = cal_housing.data
y = cal_housing.target
NameError: name 'cal_housing' is not defined

This can also be obtained with X, y = fetch_california_housing(return_X_y=True).

Let’s have a look at the shape of X and y:

X.shape
NameError: name 'X' is not defined
y.shape
NameError: name 'y' is not defined

While not at all necessary, we can turn this bunch object into a more familiar data frame to explore the data further:

cal_housing_df = pd.DataFrame(cal_housing.data, columns=cal_housing.feature_names)
NameError: name 'pd' is not defined
cal_housing_df.head()
NameError: name 'cal_housing_df' is not defined
cal_housing_df.tail()
NameError: name 'cal_housing_df' is not defined
cal_housing_df.info()
NameError: name 'cal_housing_df' is not defined
cal_housing_df.describe() 
NameError: name 'cal_housing_df' is not defined

We can even plot it:

plt.hist(y)
NameError: name 'plt' is not defined

Create and fit a model

Let’s start with a very simple model: linear regression.

model = LinearRegression().fit(X, y)
NameError: name 'LinearRegression' is not defined

This is equivalent to:

model = LinearRegression()
model.fit(X, y)

First, we create an instance of the class LinearRegression, then we call .fit() on it to fit the model.

model.coef_
NameError: name 'model' is not defined

Trailing underscores indicate that an attribute is estimated. .coef_ here is an estimated value.

model.coef_.shape
NameError: name 'model' is not defined
model.intercept_
NameError: name 'model' is not defined

We can now get our predictions:

y_hat = model.predict(X)
NameError: name 'model' is not defined

And calculate some measures of error:

  • Sum of squared errors
np.sum((y - y_hat) ** 2)
NameError: name 'np' is not defined
  • Mean squared error
mean_squared_error(y, y_hat)
NameError: name 'mean_squared_error' is not defined

MSE could also be calculated with np.mean((y - y_hat)**2).

mean_absolute_percentage_error(y, y_hat)
NameError: name 'mean_absolute_percentage_error' is not defined

Index of minimum value:

model.coef_.argmin()
NameError: name 'model' is not defined

Index of maximum value:

model.coef_.argmax()
NameError: name 'model' is not defined
XX = np.concatenate([np.ones((len(X), 1)), X], axis=1)

beta = np.linalg.lstsq(XX, y, rcond=None)[0]
intercept_, *coef_ = beta

intercept_, model.intercept_
NameError: name 'np' is not defined
np.allclose(coef_, model.coef_)
NameError: name 'np' is not defined

This means that the two arrays are equal element-wise, within a certain tolerance.

X_test = np.random.normal(size=(10, X.shape[1]))
X_test.shape
NameError: name 'np' is not defined
y_test = X_test @ coef_ + intercept_
y_test
NameError: name 'X_test' is not defined
model.predict(X_test)
NameError: name 'model' is not defined

Of course, instead of LinearRegression(), we could have used another model such as a random forest regressor (a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting) for instance:

model = RandomForestRegressor().fit(X, y).predict(X_test)
model
NameError: name 'RandomForestRegressor' is not defined

Which is equivalent to:

model = RandomForestRegressor()
model.fit(X, y).predict(X_test)

Example 2: breast cancer

Load and explore the data

b_cancer = load_breast_cancer()
NameError: name 'load_breast_cancer' is not defined

Let’s print the description of this dataset:

print(b_cancer.DESCR)
NameError: name 'b_cancer' is not defined
b_cancer.feature_names
NameError: name 'b_cancer' is not defined
b_cancer.target_names
NameError: name 'b_cancer' is not defined
X = b_cancer.data
y = b_cancer.target
NameError: name 'b_cancer' is not defined

Here again, we could have used instead X, y = load_breast_cancer(return_X_y=True).

X.shape
NameError: name 'X' is not defined
y.shape
NameError: name 'y' is not defined
set(y)
NameError: name 'y' is not defined
Counter(y)
NameError: name 'Counter' is not defined

Create and fit a first model

model = LogisticRegression(max_iter=10000)
y_hat = model.fit(X, y).predict(X)
NameError: name 'LogisticRegression' is not defined

Get some measure of accuracy:

accuracy_score(y, y_hat)
NameError: name 'accuracy_score' is not defined

This can also be obtained with:

np.mean(y_hat == y)
def sigmoid(x):
  return 1/(1 + np.exp(-x))

x = np.linspace(-10, 10, 100)
plt.plot(x, sigmoid(x), lw=3)
plt.title("The Sigmoid Function $\\sigma(x)$")
NameError: name 'np' is not defined
y_pred = 1*(sigmoid(X @ model.coef_.squeeze() + model.intercept_) > 0.5)
assert np.all(y_pred == model.predict(X))

np.allclose(
    model.predict_proba(X)[:, 1],
    sigmoid(X @ model.coef_.squeeze() + model.intercept_)
)
NameError: name 'X' is not defined
def make_spirals(k=20, s=1.0, n=2000):
    X = np.zeros((n, 2))
    y = np.round(np.random.uniform(size=n)).astype(int)
    r = np.random.uniform(size=n)*k*np.pi
    rr = r**0.5
    theta = rr + np.random.normal(loc=0, scale=s, size=n)
    theta[y == 1] = theta[y == 1] + np.pi
    X[:,0] = rr*np.cos(theta)
    X[:,1] = rr*np.sin(theta)
    return X, y

X, y = make_spirals()
cmap = matplotlib.colormaps["viridis"]

a = cmap(0)
a = [*a[:3], 0.3]
b = cmap(0.99)
b = [*b[:3], 0.3]

plt.figure(figsize=(7,7))
ax = plt.gca()
ax.set_aspect("equal")
ax.plot(X[y == 0, 0], X[y == 0, 1], 'o', color=a, ms=8, label="$y=0$")
ax.plot(X[y == 1, 0], X[y == 1, 1], 'o', color=b, ms=8, label="$y=1$")
plt.title("Spirals")
plt.legend()
NameError: name 'np' is not defined

Create and fit a second model

Here, we use a logistic regression:

model = LogisticRegression()
y_hat = model.fit(X, y).predict(X)
accuracy_score(y, y_hat)
NameError: name 'LogisticRegression' is not defined
u = np.linspace(-8, 8, 100)
v = np.linspace(-8, 8, 100)
U, V = np.meshgrid(u, v)
UV = np.array([U.ravel(), V.ravel()]).T
U.shape, V.shape, UV.shape
NameError: name 'np' is not defined

np.ravel returns a contiguous flattened array.

W = model.predict(UV).reshape(U.shape)
W.shape
NameError: name 'model' is not defined
plt.pcolormesh(U, V, W)
NameError: name 'plt' is not defined

Create and fit a third model

Let’s use a k-nearest neighbours classifier this time:

model = KNeighborsClassifier(n_neighbors=5)
y_hat = model.fit(X, y).predict(X)
accuracy_score(y, y_hat)
NameError: name 'KNeighborsClassifier' is not defined
u = np.linspace(-8, 8, 100)
v = np.linspace(-8, 8, 100)
U, V = np.meshgrid(u, v)
UV = np.array([U.ravel(), V.ravel()]).T
U.shape, V.shape, UV.shape
NameError: name 'np' is not defined
W = model.predict(UV).reshape(U.shape)
W.shape
NameError: name 'model' is not defined
plt.pcolormesh(U, V, W)
NameError: name 'plt' is not defined

We can iterate over various values of k to see how the accuracy and pseudocolor plot evolve:

fig, axes = plt.subplots(2, 4, figsize=(9.8, 5))
fig.suptitle("Decision Regions")

u = np.linspace(-8, 8, 100)
v = np.linspace(-8, 8, 100)
U, V = np.meshgrid(u, v)
UV = np.array([U.ravel(), V.ravel()]).T

ks = np.arange(1, 16, 2)

for k, ax in zip(ks, axes.ravel()):
  model = KNeighborsClassifier(n_neighbors=k)
  model.fit(X, y)
  acc = accuracy_score(y, model.predict(X))
  W = model.predict(UV).reshape(U.shape)
  ax.imshow(W, origin="lower", cmap=cmap)
  ax.set_axis_off()
  ax.set_title(f"$k$={k}, acc={acc:.2f}")
NameError: name 'plt' is not defined