Hands-on Machine Learning

February 21, 2021

Chapter 2 - End-to-End Machine Learning Project

Work with real world data when learning. Example Open Datasets

UC Irvine Machine Learning Repository
Kaggle Datasets
Amazons AWS datasets
Meta Portals: OpenDataMonitor, Quand1
Wikipedias list of machine learning datasets
Datasets subredit

Example project uses the California Housing Prices dataset from statlib

Goal: Use California Census data to build a model of housing prices in thes state.

The dataset has

population
median income
median housing price for each “block group”

Block Group has 600 - 3K people and will be called a district in the example

“The models output - a prediction of a districts median housing price, will be fed into downstream systems to make investment decisions.

ML tends to use Pipelines decoupled processes that use datasets (databases) as sources/sinks.

Without having the median housing price in a district individual housing price estimates are off +/- 20%

This is a “supervised learning” task since we have labelled data and and this is going to be a typical “regression analysis”. Specifically multiple regression since the system will use multiple features to make a prediction.

It is a “univariate Regression” since we are only trying to predict a single value for each district.

We will use the Root Mean Square Error (RMSE) as our Performance measure.

RMSE is calculated by averaging the squares of the perdiction function applied to the data vs the actual value. Then taking the square root of the average.

If there are many outlier districts the “Mean Absolute Error” may be preferred.

MAE is the mean of the absolute value of the differences between the actaul result and the prediction function.

MAE and RMSE are measuring the distance between prediction values and the target values.

RMSE is l2 and measures the distance between the two points. AKA Euclidean Norm Hypothenuse of the Triangle
MAE is l1 and measures the distance between two points by travelling along orthoganl lines to it. The Base and high of a triangle. AKA “Manhattan Norm”
RMSE performs best with normal data. MAE with data that contains outliers.

Think of how downstream pipelines are going to consume the data? As exact values? Or Categories. Output what they need.

Creating a test set is key. In this example we take 20% and set them aside. This test set can help you detect data snooping bias.

Juypter notebook is used to work with the data.

If you generate you test set using random data you may miss data stratification. if your data is stratified in some way tools that can select from randomly from each stratification can give you more acturate results then simple random samples.

sklearn has the StatifiedShuffleSpit import for this use.

Getting good test data is Key to checking your results and is often a overlooked step.

Scikit-Learn Design - page 64

All objects have:

Estimators - performed by the fit() method with a dataset or a dataset + labels parameters.
Transformers - performenc by the transform() method, mutates the dataset and returns it. using learned parameters from an imputer (estimator)
Predictors - perdict() - given a dataset of new instances returns predictions and score() measure the quality of predictions given a test set.
Inspection - All the estimators hyperparameters are accessable directly via public instance variables
Nonproliferation of classes - datasets are represented as NumPy arrays or SciPy sparse matrices rather then homemade classes
Composition - existing building blocks are reused as much as possible.
Sensible defaults

A hyperparameter is just a regular python string or number.

Feature Scaling : ML algorithms don’t preform well when the input numerical attributes have different scales.

min-max scaling - aka normalization. Values are shifted and rescalled so that they end up ranging from 0 to 1. subtract the min value and dividing by the max minus the min. Provided by the MinMaxScaler
standardization: subtract the mean value and then divides by the standard deviation so that it has unit variance. Will not bound values to a specific range. But it is less affected by outliers. StandardScaler is provided to do this.

only scale the training data not the dataset!

….

Page 82: Its not always possible to determine model performance without any human analysis.

Your model will rot and you will need to update it.

Monitoring and updating a model is more work then creating it.

Treat you models like application assets. Version them and be ready to roll back.

Chapter 3 Classification

Chapter 2 is about regression analysis, chapter 3 is on classification.

The MNIST dataset is 70,000 small images of digits handwritten by high school students and Census Bureau employees. It is a labeled set with each image labeled with the digit it represents.

The MNIST datset is the “hello world” of ML datesets. People often test their new classification algorithm on it.

MNIST is already split into test and training sets first 60K training last 10K test

First we train a binary classifier to learn 5 vs not-5

We use the SGDClassifier - Stochastic Gradient Descent (SGD).

Training is done one at a time which makes it good for online learning
SGD relies on randomness (from the word Stochastic), to get reproducable results set the random state to a constant.

When we check the cross value score it is high 93% but this is an error because there isn’t very many 5’s!, its not much better then just saying everything is “not 5”. This is why accuracy is not the perferred way to measure classifiers. Esp with “skewed datasets” which contain some classes more frequently than others.

We are better off using a “Confusion Matrix” - count the number of times an instance is mislabeled.

The confusion matrix outputs false positives and false negatives, it is easier to understand its output by calculating the “percision” of the classifier

True Positives / ( True Positives + False Positives)

You can game precision by making a single positive perdiction, so its useful to report the “recall” or “sensitivity” or “True Positive Rate” of your classifier which is

True Positives / ( True Positives + False Negatives )

precision and recall are often combined as F1. Which is the “harmonic mean” of percision and recall. The harmonic mean gives more weight to low values vs. the simple mean. So a classifier will only ge ta high F1 if recall and percision are high

F1 = 2 / ( (1/precision) + (1/recall) = 2 x (percision x recall)/(precision + recall) =

True Positive / ( True Positve + (( False Negative + False Positive) / 2) )

Which you prefer percision vs recall is largely situation dependant. For example its probably better to have high percision with an algorithm which classifies child safe videos (we discard nearly all bad videos at the cost of also discarding my acceptable videos) or high recall - identifiing shoft lifters - ie we catch nearly all shop lifters at the cost of investigating many innocents.

Percision/Recall is a tradeoff. Your False positives will increase as your False negatives decreses.

ROC Curve = Receiver Operating Characteristic Curve

Similar to precision/recall curve but ROC plots the true positive rate vs the false positive rate. (FPR) = 1 - True Negative Rate (TNR)
TNR is Also known as the specificity
ROC curve plots senstivity (recall) vs 1 - specificity
ploted with the roc_curve() function

The intergal of the ROC is how to measure the quality of a classifier.
A perfect classifer will have a AUC (area under the curve) of 1. A bad classifer will have a AUC of 0.5 (random)

Next tested is a RandomForestClassifier which performs superior as proven by a higher ROC AUC

MultiClass Classification = distingusih between more then two classes. (aka more then pizza/not-pizza)

One way to create a multiclass classifier is to train N binary classifiers and on each value score all N of the binary classifiers and pick the answer with the most confience.. This is called one-versus-the-rest (OvR) or one-versus-all

Another stragey is to train it for every pair eg 0 and 1, and 0 and 2. This is One Versus One OvO. This requires training N x (N -1) / 2 classifiers which grows large fast. 45 for 10 MNIST.

SciKit-Learn can detect when you try to use a binary classification algorithm and automatically runs OvR or OvO.