Create a pipeline to score different machine learning models with scikit-learn

After the initial data exploration I would like to get a quick gauge on what model would be best for the problem at hand.

A rough estimate helps in narrowing which machine-learning model to use and tune later. It helps to get a sense on how effective perspective algorithms will be.

The goal is to get a big picture overview.

How to Write a Pipeline to Score Different Models

  1. Prep

I assume that you have a dataset with features (X) and target labels (y).

Import the models you want to score.

  1. Create preprocessing pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import FeatureUnion, Pipeline

# Create features union
# Standardizes feature matrix, uses TSVD, then selects 6 best features
features = [('standardize', StandardScaler()),
            ('tsvd', TruncatedSVD(n_components=tsvd_components))]
feature_union = FeatureUnion(features)
  1. Create models pipeline

Now combine the feature_union pipeline with a scikit-learn.


# Create pipeline
# combines feature union with scikit-learn estimator

# Logistic Regression
estimators_log_r =  [('feature_union', feature_union),
                     ('logistic', LogisticRegression(random_state=42))]
model_log_r = Pipeline(estimators_log_r)

# SVC
estimators_svc = [('feature_union', feature_union),
                  ('svc', SVC(probability=True, random_state=42))]
model_svc = Pipeline(estimators_svc)

# Random Forest
estimators_rf = [('feature_union', feature_union),
                 ('rf', RandomForestClassifier(n_jobs=-1, random_state=42))]
model_rf = Pipeline(estimators_rf)

models = {'Logistic_Regression': model_log_r,
          'SVC': model_svc,
          'Random_Forest_C': model_rf}
  1. Score models
from sklearn.model_selection import cross_val_score

scores = {name: cross_val_score(model, X, y) for name, model in models.items()}

Now you have a dictionary that contains the validation scores from cross_val_score for each scikit-learn estimator.

Further Reading