Overview
After a few more meetings, your team has been assigned to address the following issues asked by the stakeholders:
Tom Jones, Head of marketing and brand development
After our discussion yesterday, I went back to corporate, and they want it all.
They want to know if a song is going to be popular, they want to know why it's popular, and they want to know if we can quantify how tastes have changed over the decades.
Ezra, Lead Singer of the Wasps
I want to know one more thing, sort of the opposite of Tom's questions. Obviously, everyone wants to top the charts, but I know a lot of guys that are living large just based on royalty from midlister songs — songs that aren't super popular, but that also aren't horrible.
So if popularity has a big timing component to it, can you also build a model or something that says "whatever you do, don't make a song like this, regardless of when it comes out?
Tom Jones, Head of marketing and brand development
In other words, I want a model that predicts hits.
Ezra wants to know if there's a way to predict flops.
Wouldn't that be the same model, but in reverse?
Ezra, Lead Singer of the Wasps
Maybe?
Tom Jones, Head of marketing and brand development
Either way, please have a summary of your research ready by Saturday.
Johnny, the data science intern
Hey, I put together some tips for us to use, just like old times!
Executive Summary
By now you know what level of quality we expect from you in terms of analysis summaries. From this point on, you will not be provided with an executive summary template.
While you now have some leeway in the styling of this report, please ensure that what you turn in has the same level of quality and professionalism as previous templates.
Tips from Johnny
Data Dictionary
Our database analyst put together this data dictionary to help explain the values and sources of different columns in the datasets.
Different Views of the Data
There are several different views of the historic song data. One that gives just the raw data for each track. Another grouped by year, where all the hit songs for that year are aggregated based on the mean or mode values for each attribute, another by artist, another by genre, etc...
Some of those may be more useful than others in your analysis. You might also want to generate your own summaries of the data using something other than mean and mode.
k-Nearest Neighbors
You can find documentation on how to use kNN with sci-kit learn on these pages:
Hyper Parameters
When you build a model, you specify some settings for how the model should work. Which algorithm to use, the value of different constraints, etc... In machine learning, these values are collectively called "Hyper Parameters".
These settings will have a large effect on how well your model performs. You could test different combinations manually, or you could use an automated method such as Grid Search.
Johnny, the Data Science Intern, catches you after work:
Hey, I know you're probably busy, but in case you need it, I put together this collection of Google Colab notebooks that might be useful for this project.