Machine Learning & Data Analytics

Course Index

Introduction

Credit Card Transaction Photo by Blake Wisz on Unsplash

This material was originally part of a university course I created on data science and machine learning. I've reorganized the pages a bit and indexed them here to make them easier for the general public to make use of the information.

If you're a student in CSE 450, this probably isn't the page you're looking for. I would suggest the official course index found here.

Some of the readings make reference to the course textbook Fundamentals of Machine Learning for Predictive Data Analytics (1st edition), which I highly recommend.

Though you can almost certainly find alternative sources for the topics covered in the text by googling, the book does a great job of providing a high level overview of the topic, as well as covering the math and statistics needed to adequately understand how a particular topic should be used (and when it shouldn't be).

The course is designed around a series of two-week modules, though the exact timing doesn't matter so much outside of a traditional classroom setting.

Resources and Tips

Here are some tips and reference pages we provided to students to help them succeed in the course:

Module One: A Crash Course in Data Analytics and Visualization

The first module has four topics. Each assignment is designed to introduce one or more topics of data analytics and visualization, and to give students practice taking a business problem and re-framing it as an analytics or visualization problem.

Each topic has a Preparation reading that covers the theory, math, and statistical background of a given topic, as well as some references and tips for the coding aspects of the topic. This is followed by a Data Exploration assignment presented in the form of a simulated case-study.

Classroom Implementation Tips

The first module is designed as a refresher/fast introduction into data wrangling and visualization using Python. Most of the students taking this course had had exposure to one or more of these topics in previous courses, but that wasn't always the case.

At the end of the first module, a self-evaluation assignment allows students to determine how likely they are to be able to move forward with the course successfully.

The timing of this assessment falls prior to the drop/add deadline, which gave me a chance to sit down with students and recommend alternative preparatory courses as needed.

1. Netflix Movies

Netflix Photo by Mollie Sivaram on Unsplash

Topics Covered

2. Cereal Marketing

Cereal Photo by Haley Owens on Unsplash

Topics Covered

3. Titanic Survival Patterns

Iceberg Photo by Derek Oyen on Unsplash

Topics Covered

4. Flower Classification

Iris Flower

Topics Covered

Self-Assessment

This assignment is designed to assess how well students understood the topics from the previous five assignments by testing their ability to apply the concepts to similar, but different situations.

Modules Two - Six: An Applied Survey of Machine Learning Topics

In each of the remaining modules, students learn about the theory and application of a machine learning algorithm through a preparation reading assignment.

Each reading also introduces auxiliary topics, such as feature engineering, model evaluation techniques, and tuning of hyperparameters.

For each topic, a business problem is then introduced through a Case Study Intro, where students learn about the business problem, and the primary concerns of the key stakeholders.

The students then engage in an In-class Case Study Discussion where stakeholders ask questions of the "data science team". Students discuss the question as a team and then share their answers and the reasoning behind them. Some questions have a particularly correct answer, while others have multiple answers that each have a reasonable amount of merit.

Students then start the Case Study Project, which outlines key deliverables the stakeholders are interested in.

Classroom Implementation Tips

If you're adopting this course for use in a classroom, here are some implementation notes that will make this more successful:

Module 02: Banking and Market Segmentation

Portugal Photo by Ricardo Resende on Unsplash

This module covers decision trees, and more model evaluation methods (ROC and AUC) within the context of a Portuguese bank that wants to engage in some targeted marketing efforts.

Module 03: House Price Estimation

Meeting Photo by Campaign Creators on Unsplash

This module covers model ensembles, gradient boosted trees, and using the XGBoost library to solve regression tasks within the context of a real estate firm that wants to use machine learning to improve the accuracy of their housing valuations.

Module 04: Bike Rentals and Resource Planning

Bikes Photo by Markus Winkler on Unsplash

This module introduces basic neural networks within the context of a bike rental startup that wants to find a better way to handle staff and inventory planning.

Module 05: Self Driving Cars and Computer Vision

Car Photo by Campbell Boulanger on Unsplash

This module introduces convolutional neural networks and data augmentation within the context of an automotive company doing research into self-driving cars.

Module 06: AI Content Generation

Printing Machine Photo by Bank Phrom on Unsplash

This module introduces recurrent neural networks within the context of a publishing company seeking to expand its catalogue using AI-generated content.

"Bonus" modules

These modules were cut from the course either due to fit in more relevant algorithms, or due to other issues with the data.

KNN Module: Predicting Hits from Spotify Data

Song Photo by Austin Neill on Unsplash

Before we had kNN as part of the flower classification assignment in week 01, we had an entire module on the topic, framed within the context of a record label trying to identify trends in top songs based on metrics in Spotify data.

The module covered kNN in a bit more detail than the module one assignment does, but after a couple of iterations of the course, we found this problem to be a bit ill-defined, and students were having trouble producing actionable insights from the data, so I ultimately cut the module and distilled the key points into module one.

Clustering Module: DE&I in the Workplace

Meeting Photo by Campaign Creators on Unsplash

I loved the premise of this topic - evaluating hiring practices and salary trends to be able to quantitatively answer questions about workplace diversity and pay gap issues. It also was the only module in the course that covered unsupervised learning (via k-Means Clustering).

Unfortunately, the dataset had a lot of underlying issues due to how it was generated, which led to students coming to some pretty obvious conclusions too quickly.

At the same time, we really wanted to make room for RNNs, which we felt was more important for our students to understand than unsupervised learning. So this module was ultimately cut from the course.