Mastodon hachyterm.io

I’m going through the Udemy course Complete Machine Learning and Data Science: Zero to Mastery and writing down my observations/lecture notes.

This is the seventh part of the blog post series.

13. Data Engineering

These lectures cover what kind of data we have (structured data, unstructured data, etc.).

How can we make the raw data consumable for machine learning libraries?

What are data pipelines, and what is the role of a data engineer?

Data collection/data engineering is a previous step before we can do data modeling.

  • Data Ingestion: collecting data
  • Data Lake: a collection of raw figures
  • Pipelines: filter/clean/converting
  • Data Warehouse: storage for sanitized data

Machine learning engineers mainly work with data lakes.
Business analysts mostly work with data warehouses.

ETL Pipeline: extract, transform, load

Types of Databases

  • relational database (e.g., PostgreSQL, MySQL), good for asset transactions
  • NoSQL (e.g., MongoDB), scalable distributed databases
  • NewSQL, try to combine the former two
  • specialized databases, e.g., ElasticSearch
  • OLTP, OLAP - Online Transactional Processing, Online Analytical Processing

The instructor briefly explains technologies like Hadoop, Apache Spark, Flink, or Kafka.

Batch processing (example: Hadoop with Apache Spark) vs. stream processing (example: Kafka with Spark Streaming).

14. Neural Networks: Deep Learning, Transfer Learning and Tensorflow 2

Tensorflow is an open-source library for deep learning (unstructured data). With Tensorflow, we can work with audio files, text (natural language), images.

Tensorflow allows you to run on a GPU, and it offers many pre-build deep learning models.

The course uses Google Colab to run Tensorflow.

Setting up the Colab environment and uploading the required files proved to be frustrating. That’s not a fault of the instructor, just a bad experience with the tool.

After that, the class strikes me as a smooth entry-level tutorial on how Tensorflow works. The instructor shows how to build an image recognition model that identifies dog breeds when given an image with a dog.

I love the practical approach, and you can follow the project step by step.

Tensorflow and the Keras API could be a course on their own. Thus, Daniel covers the essential functions that you need for the project. Although we only scratch the surface, the course offers enough material to get your feet wet.

If you’re a total beginner programmer, it will be challenging to understand the interaction with Tensorflow and Keras. Working with these libraries requires you to write many functions to transform data.

The instructor glosses over the intricacies of the libraries. For me, that’s a good thing, as you can keep your focus on the project at hand. On the other hand, you don’t understand what you’re doing if you don’t study the documentation on your own.

On the plus side, you have a working deep learning model at the end of the chapter. After finishing the section, you have a basic understanding of how to use the libraries.

Notes

Deep Learning is the combination of several neural networks (each is a machine-learning model).

There are three main types of deep learning problems:

  • classification (image processing, email spam)
  • sequence to sequence (audio processing, e.g., Google Translate)
  • object detection (find an object inside an image)

Transfer Learning is taking one model and apply those patterns to a different domain.

Recap

The class offers an adequate introduction into working with Google Colab, Tensorflow, and the Keras API.

The course doesn’t delve deep into deep learning. Math and theory are sparse in favor of gaining practical knowledge of the libraries.

Daniel manages to break down the steps to finish a design. His methods are a valuable blueprint in tackling your projects.


Go to the other parts of the series: