Mastodon hachyterm.io

I’m going through the Udemy course Complete Machine Learning and Data Science: Zero to Mastery and writing down my observations/lecture notes.

This is the fifth part of the blog post series.

9. Scikit-Learn

Up until now, we’ve learned how to consume data and make fancy diagrams.

The current section finally deals with Machine Learning and teaches you the basics of Scikit-learn.

Scikit-Learn Workflow

  1. Get data ready
  2. Pick a model
  3. Fit the model to the data and make prediction
  4. Evaluate the model
  5. Improve through experimentation
  6. Save and reload your trained model

Example code (check the course’s Jupyter Notebook for a better display):

## 1. Get the data
import pandas as pd
import numpy as np
heart_disease = pd.read_csv("../data/heart-disease.csv")

## Create X (Features Matrix)
X = heart_disease.drop("target", axis=1)

## Create y (labels)
y = heart_disease["target"]

## 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

## keep the default hyperparameters
clf.get_params()

## 3. Fit the model to the training data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train);

## make a prediction
y_preds = clf.predict(X_test)

## 4. Evaluate the model on training data and test data
clf.score(X_train, y_train)
clf.score(X_test, y_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

## 5. Improve a model
## Clumsy example, there are better inbuilt methods
## Try different amount of n_estimators
np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators.")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
    print("")

## 6. Save the model and load it
import pickle

pickle.dump(clf, open("../data/random_forest_model_1.pkl", "wb"))
loaded_model = pickle.load(open("../data/random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

Notes

Feature engineering: preparing data for the machine-learning model: deal with missing values, encode non-numerical data.

Features vs Target:
Think of the Pandas DataFrame representation of a dataset as a two-dimensional matrix.
Each row is a sample. Each column contains a piece of information. The columns are called features.

The features matrix is often named X.

We also have a label or target vector. That’s the output data we want to achieve with using Machine-Learning.

The distinguishing feature of the target array is that it is usually the quantity we want to predict from the data: in statistical terms, it is the dependent variable. 1

The convention is to use y.

How to choose the right estimator? see scikit-learn map.

Model validation:

  • holdout sets (where you split the data into a training set and a testing set)
  • cross-validation

Example cross validation code:

## created split data and trained model on test train data split
## see code above

## imports
from sklearn.cross_validation import cross_val_predict

## Make cross validated predictions
y_pred = cross_val_predict(model, X, y, cv=6)

Metrics:

classificationregression
accuracyr^2
precisionmean absolute error
recallmean squared error (MSE)
f1root mean squared error (RMSE)

Classification Metrics:

  • precision: indicates proportion of positive identifications
  • recall: indicates proportion of actual positives which were correctly identified
  • F1: combination of precision and recall (perfect model = 1.0)
  • support: number of samples each metric was calculated on
  • accuracy: accuracy in decimal form
  • macro avg: average precision, recall and F1 score between classes (doesn’t take class imbalances into account)
  • weighted avg: like macro avg, but weighted

Regression Metrics:

  • R^2: compares your model’s predictions to the mean of the targets - ideal value = 1
  • Mean absolute error (MAE): average of the absolute differences between predictions and actual values
  • Mean squared error (MSE): average squared differences between predictions and actual values (removes negative errors, amplifies outliers (samples which have larger errors)

Improving a model:

  • first predictions = baseline predictions
  • first model = baseline model
  • improving from a data perspective:
    • could we collect more data?
    • could we improve our data?
  • improving from a model perspective:
    • is there a better model we could use?
    • could we improve the current model?

Parameters: model finds these patterns in data
Hyperparameters: settings on the model you can adjust

Use pipelines to build the final workflow in a concise way.


Thoughts

The section covers an enormous topic: supervised learning using a sophisticated Python library.

Daniel conveys the foundations and practical application in straight terms.
You have to type a lot of code. For me, the pace was too slow at times.

I would have liked to see more in-depth explanations of the different machine learning models. I appreciate that the Udemy class is a practical course, but getting the idea behind the models with some simplified math would have been helpful.

Without looking up linear regression and Bayes classification, the use of complicated machine learning models felt too abstract.
The lectures jumped directly to complicated models like Random Forest without covering the intuition behind the simpler models first.

Applying a machine learning model without understanding why the model could be a good fit for the problem is not helpful for me.

Still, the class covers the essential techniques of working with scikit-learn and supervised learning. The library is massive, but Daniel boils it down to the crucial parts.


Go to the other parts of the series: