Machine Learning & Data Analytics

Module 03 — Housing Estimates, Class Discussion Questions

Meeting Photo by Campaign Creators on Unsplash

Questions

You're about to go into a strategy meeting with the CEO, Vice President of Human Resources, and Vice President of Finance. They want to make sure you have the data required to answer the questions they're most interested in.

Be prepared to answer the following questions:

Problem Type

Devon, the CEO says:

I just sat through four hours of machine learning training with the board of directors this past week, so I'm curious to get your take on this.

Looking at the data and our business model, what kind of machine learning problem do you think we're looking at here?

Based on your initial analysis of the data, your team feels:

  1. This is a supervised regression problem
  2. This is a supervised classification problem
  3. This is an unsupervised learning problem
  4. This is a semi-supervised learning problem

Model Confidence

Cecil, the VP of Customer Relations asks:

My biggest concern right now is making sure that whatever method we come up with to predict housing prices, we can also attach some kind of empirical confidence metric.

Based on your initial analysis of the data, your team feels you can best show confidence in your model by using:

  1. The sum of squares error (SSE).
  2. The mean squared error (MSE).
  3. The root mean squared error (RMSE).
  4. The $R^2$ value.

Insurance Question

William, the VP of Finance asks:

Our insurance division is particularly interested in making sure our investment portfolio avoids certain — uh — less savory property types.

Is there a way we can easily identify properties in low income areas and have the model lower those estimates to protect our investors' interests?

Based on your initial analysis of the data, your team feels:

  1. We have the necessary data in the correct form to answer this question.
  2. The data we have cannot answer that question, we need to collect more data.
  3. We could use the data we have, but we'll have to normalize some of the features, and/or encode some of them differently.
  4. Answering this question would be a violation of ethics and/or privacy laws.

Data Analysis

Johnny, the data science intern asks:

The head of data science says we should use gradient boosted trees for this analysis.

I've noticed that a lot of the features use pretty different ranges.

For example, how should we handle square footage?

Based on your initial analysis of the data, your team feels:

  1. We should normalize square footage values using range normalization (aka min-max scaling).
  2. We should standardize square footage values using z-score normalization.
  3. We should use binning to group square footage values into discrete categories.
  4. We should be fine sticking with the raw values.

  1. CEO photo by Oz Seyrek on Unsplash  

  2. VP of Customer Support photo by Christina @ wocintechchat.com 

  3. VP of Finance photo by steffen Wienberg on Unsplash 

  4. Data Science Intern photo by Fábio Lucas on Unsplash