Mastodon hachyterm.io

Today I learned how to reduce feature labels in a data set with Principal Component Analysis.

From Python Data Science Handbook:

Principal component analysis is a fast and flexible unsupervised method for dimensionality reduction in data, […]

You can use PCA to learn about the relationship between two values:

In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.

Let’s assume we have a pandas DataFrame called diabetes_df with 10 different columns (features).

We can use scikit-learn’s PCA estimator to reduce the feature labels from 10 to 2. Then we can try to visualize the data points with matplotlib.

## Reduce dimensionality with PCA
from sklearn.decomposition import PCA

## instantiate model with 2 dimensions
pca = PCA(2)

## project from 10 to 2 dimensions
project_diab = pca.fit_transform(diabetes_df)

## plot
plt.scatter(project_diab[:,0], project_diab[:,1],
            c=diabetes.target, edgecolor='none', alpha=0.5,
            cmap=plt.get_cmap('Spectral', 10))
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();

For a visual explanation of Principal Component Analysis, I can recommend this site: Principal Component Analysis Explained Visually.