Author

Marie-Hélène Burle

Scikit-learn has a very clean and consistent API, making it very easy to use: a similar workflow can be applied to most techniques. Let’s go over two examples.

This code was modified from Matthew Greenberg.

## Load packages

from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
mean_squared_error,
mean_absolute_percentage_error,
accuracy_score
)

import pandas as pd

import matplotlib
from matplotlib import pyplot as plt

import numpy as np

from collections import Counter

## Example 1: California housing dataset

### Load and explore the data

cal_housing = fetch_california_housing()
type(cal_housing)
sklearn.utils._bunch.Bunch

Let’s look at the attributes of cal_housing:

dir(cal_housing)
['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']
cal_housing.feature_names
['MedInc',
'HouseAge',
'AveRooms',
'AveBedrms',
'Population',
'AveOccup',
'Latitude',
'Longitude']
print(cal_housing.DESCR)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
- MedInc        median income in block group
- HouseAge      median house age in block group
- AveRooms      average number of rooms per household
- AveBedrms     average number of bedrooms per household
- Population    block group population
- AveOccup      average number of household members
- Latitude      block group latitude
- Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,