Machine Learning & Data Analytics

Module 02 — Targeted Marketing, Project

Overview

After a few more meetings, Beatriz has assigned your team to address the following issues asked by the stakeholders:

Miguel Ferreira, Bank President asks:

Like I said the other day, the core task we're interested in is identifying those customers most likely to subscribe to a term deposit.

This way, we can build a targeted marketing campaign that focuses primarily on those customers.

Francisco, VP of Marketing says:

And I'd like you to find any actionable patterns in our results. Should we only call single people on Saturdays? Does it make sense to call students at all?

Things like that.

Miguel Ferreira, Bank President adds:

One other thing we should probably address, does contacting people too frequently for these marketing campaigns have an adverse affect on the outcome?

Beatriz, Senior Data Scientist says:

One last thing, there are a bunch of social and economic indicators in the data.

We should be careful about how we consider these. We may want to see separate models for times when, for example, the consumer confidence index is high compared to when it is low.

Miguel Ferreira, Bank President adds:

Good thought Beatriz. Different customer segments tend to react to economic changes differently.

We'll definitely want to know if it's better to use a particular model during different economic situations.

Beatriz, Senior Data Scientist says:

One last thing, since we're planning on deploying these models, we'll want to make sure that once you have the models trained and tested, that you persist those trained models to a file so we can load them into our production systems.

Miguel Ferreira, Bank President adds:

Sounds like we have enough to get started. If you could send us your write up on this by Saturday night, that would be great.

Team Project Expectations

Be sure to read over the Team Project Expectations guide to know what the expectations are for this and future projects.

Tips from Johnny

Johnny, the data science intern, whispers to you after the meeting:

Hey, I put together a list of tips and ideas that might help us out:

Data Dictionary

Our database analyst put together this data dictionary to help explain the values and sources of different columns in the bank dataset, so be sure to review that.

Target Variable

One oddity here is that our target feature is simply labled y, but it's a boolean indicating "y" or "n", did the client subscribe to a term deposit.

Feature Scaling

If you're going to be comparing different numeric features, be sure they are using the same scale. You may find it useful to use min-max scaling to handle this problem. You could do this calculation manually, or use Sci-Kit Learn's MinMaxScaler

Binning

Just as you did with the Titanic dataset when you reduced the number of titles, you may find it useful to "bin" categorical features into discrete groups in order to address some of the questions above. There are multiple ways to do this, but previously we used the map() function.

Decision Trees

You can find documentation on how to use decision trees with sci-kit learn on these pages:

Model Persistence

When you train a model, a large amount of information is stored in memory. That model can then be used to make predictions for new instances at a later time.

You'll want to save these trained models using python's pickle module, as shown here.

However, rather than using pickle's default protocol version, you should use protocol version 5, which was introduced in Python 3.8 and is optimized for dealing with structures that contain numpy arrays and pandas data frames.

Model Ensembles, Bagging, and Boosting

Often, we can get better results by using a set of models, each using a slightly different set of training data, or other parameters. These are called "Model Ensembles" and it's very common to use an ensemble of decision trees (often called a "Random Forest") rather than a single tree.

Two popular techniques used in the creation of ensembles are "boosting" and "bagging". You can read more about these topics on pages 163 - 167 of your textbook.

For details on how to use these techniques with Sci-Kit Learn, see this page.

Avoiding overfitting through pruning

It is very easy to overfit a decision tree. The text discusses strategies to avoid this problem in section 4.4.4 (pages 158 - 163).

In SciKit-Learn, you can use parameters such as max_depth and min_samples_leaf to control tree complexity and overfitting.

Alternatively, you can use something more elaborate, such as cost complexity pruning.

Johnny, the Data Science Intern, drops by your hotel room around midnight:

Okay, just one last thing, if you need any more help at all, I put together this collection of Google Colab notebooks that might be useful.


  1. CEO photo by Oz Seyrek on Unsplash  

  2. VP of HR photo by Christina @ wocintechchat.com 

  3. VP of Finance photo by steffen Wienberg on Unsplash 

  4. Data Science Intern photo by Fábio Lucas on Unsplash 

  5. Data Science Intern photo by Fábio Lucas on Unsplash