Module 04 — Hit Songs, Case Study Discussion

Questions

You're at a strategy meeting with the stakeholders. They want to make sure you have the data required to answer the questions they're most interested in.

Be prepared to answer the following questions:

Data Science Methods

Johnny, Data Science Intern

I guess the first thing we need to decide is if this is a classification problem or a regression problem.

Based on your initial analysis of the data, your team feels:

This is a classification problem.
This is a regression problem.

Additional Insights

Tom Jones, Head of marketing and brand development

I don't want to know just whether or not a particular song is going to be a hit or not, I want to know what goes into making it a hit. What are the most important factors and which factors don't matter at all.
Is there an algorithm that can tell me that?

Based on your initial analysis of the data, your team feels:

Yes, k Nearest Neighbors can answer that question.
Yes, decision trees would do a good job here.
Yes, clustering is the way to answer that question.
No, all we can give you is a predictive model, not tell you why how it works.

Timing

Ezra, Lead Singer of the Wasps

But if we're like, using a bunch of old data, won't your model just predict what would have been a popular song twenty years ago?
How can we make sure that what the model predicts for today would actually be popular today?

Based on your initial analysis of the data, your team feels that the best way to handle this is:

To stratify the training and test data by decade to figure this out using some kind of model ensemble.
To look for features that persist across the decades and appear to be timeless.
To filter the dataset to include only the songs that were popular in the last ten years or so.
There's no way to guarantee this.

Model Evaluation

Johnny, Data Science Intern

How will we know if our model has strong predictive power?
Which evaluation metric will we use?

Based on your initial analysis of the data, your team feels:

We'll use the accuracy of test set predictions.
We'll use the $F_1$ score.
We'll use the root mean squared error (RMSE).
We'll use the $R^2$ value.