Mastodon hachyterm.io

I’m going through the Udemy course Complete Machine Learning and Data Science: Zero to Mastery and writing down my observations/lecture notes.

This is the fourth part of the blog post series.

7. NumPy

The section covers an introduction into NumPy.

NumPy will covert any data into a series of numbers. NumPy is the backbone of all data-science in Python.
Pandas and other machine-learning libraries are built on top of NumPy.
Machine Learning is finding patterns in NumPy arrays.

Behind the scenes, the library uses compiled C code. Thus it’s fast.

Find a useful overview of NumPy at A Visual Intro to NumPy and Data Representation.

Data Structures

  • main data structure ndarray
  • 1-dimensional = array, vector (example shape = [1,3] - 1 row, three columns)
  • more than one dimension = array, matrix (example shape = [2, 3, 3])

Some Useful Functions

  • view the shape of an ndarray with <ndarray>.shape
  • number of dimensions: <ndarray>.ndim
  • data type: <ndarray>.dtype
  • number of elements in array: <ndarray>.size
  • creating array, filled with ones: numpy.ones(shape, dtype, order)
  • creating array, filled with zeroes: numpy.zeroes(shape, dtype, order)
  • create array with evenly spaced values within given interval: numpy.arange([start, ]stop, [step, jdType)
  • find unique values within passed values: numpy.unique(values)
  • use Numpy’s methods on NumPy data types, use Python’s methods on Python’s data types
  • reshape ndarrays with <ndarray>.reshape(<new shape>) or with <ndarray>.T (for transpose) - see Numpy reshape and transpose
  • sort arrays with np.sort(<ndarray>), np.argsort(<ndarray>)

Observations/Notes

Broadcasting1: used to do arithmetic operations with arrays of different shapes. It’s fast because it uses C loops (instead of Python).
Broadcasting works under certain conditions.

A common task is to reshape data.

Matrix multiplication uses NumPy’s .dot operator.
The number of rows of the first matrix must match the number of columns of the second matrix.
(Incredible useful demo is available at http://matrixmultiplication.xyz/.)

Nested arrays pose a challenge for beginner programmers.
I get the impression that my experience with Lisp (Racket, Clojure) now comes handy, as I’m familiar with nested data structures.
But working with multi-dimensional arrays is not intuitive and requires some thought.

Section Review

The instructor, Daniel, takes care of gently walking you through the essential points.
He provides beginner-friendly resources that help deepen your understanding.
Even if your math skills are weak, you should be able to grasp the ideas. It would be best if you were willing to put in the work and look up unfamiliar concepts.

I would have liked more practical examples of how to use Numpy. We get a sense of that when we convert an image to a Numby ndarray. But I would have liked more exercises in that vein.

8. Matplotlib

The section covers the basics of a Python visualization library called Matplotlib. The library is a layer upon NumPy, so the concepts are familiar by now.

Notes

  • there are two different interfaces: pyplot API (less flexible) & object-oriented API (recommended)

  • example workflow:

## 0. import matplotlib and get it ready for plotting in Jupyter
%matplotlib inline
import matplotlib.pyplot as plt

## 1. Prepare data
x = [1, 2, 3, 4]
y = [11, 22, 33, 44]

## 2. Setup plot
fig, ax = plt.subplots(figsize=(10, 10)) ## (width, height)

## 3. Plot data
ax.plot(x, y)

## 4. Customize plot
ax.set(title="Simple Plot",
      xlabel="this is the x-axis",
      ylabel="this is the y-axis")

## 5. Save and show (save the whole figure)
fig.savefig("../data/images/sample-plot.png")

Observations/Section Review

It takes a while until the power of matplotlib shines through. You have to type a lot of boring code to learn how to plot basic diagrams. It felt a bit tedious, but you have to start somewhere.

The later lectures in the matplotlib section prove to be much more intriguing.

I must admit that my motivation was flagging in this part of the course. You have to learn a lot about the basics of different libraries until you can start applying your knowledge to new datasets.
I can’t wait to get more hands-on practice with a complete project.

The provided exercise is more or less a repetition of the material.

Perhaps the course would have profited from an involved practice assignment at this point: use a raw data set and clean it, using Numpy and Pandas to extract data, create visualizations with matplotlib.


Go to the other parts of the series: