Month: August 2020

Udacity Machine Learning Nano Degree Starbucks Capstone Challenge


This blog post is part of the final project for the Udacity Machine Learning Engineer Nanodegree Program using AWS SageMaker

The project contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. The goal of the project is to analyze this "historical" data in order to implement an algorithm that finds the most suiting offer type for each customer.

The data is contained in three files:

  • portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json - demographic data for each customer
  • transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:


  • id (string) - offer id
  • offer_type (string) - type of offer ie BOGO, discount, informational
  • difficulty (int) - minimum required spend to complete an offer
  • reward (int) - reward given for completing an offer
  • duration (int) - time for offer to be open, in days
  • channels (list of strings)


  • age (int) - age of the customer
  • became_member_on (int) - date when customer created an app account
  • gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
  • id (str) - customer id
  • income (float) - customer's income


  • event (str) - record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) - customer id
  • time (int) - time in hours since start of test. The data begins at time t=0
  • value - (dict of strings) - either an offer id or transaction amount depending on the record

Solution Statement

This is a classification problem, my approach is going to be: create a Machine learning model to
predict the best offer for the user, for BOGO or discount (we are going to leave out the
informational offers, which have no real "conversion")
There are the steps to follow:

  • Fetching data
  • Data Cleaning
  • Data Preparation
  • Data visualization and analysis
  • Train Model
  • Evaluate the model

Evaluation Metrics

Since this is a classification problem we can use the following metrics to evaluate the model:

  • Precision The proportion of positive cases that were correctly identified.
  • Recall The proportion of actual positive cases which are correctly identified.
  • F1-score, that combines the two previous measures.
  • Roc_auc_score Area Under the ROC Curve (AUC), a measure that calculates the area
    under the Receiving Operating Characteristic Curve. This particular curve accounts that
    higher probabilities are associated with true positives and vice-versa.

Algorithms evaluated

We explored two algorithms Linear Learner and XGBoost to find the best models for for each offer, a Discount, a Buy One Get One (BOGO)
or Informational.

Project design

During the fetching data process, we have to join all the different pieces of information coming
from the 3 data sources, we need to fetch the data.

Data preparation. After analyzing the data, we transform the dataset through different stages
missing imputation, categories encoding, data standardization, etc.

Clean and Visualize the data to understand it's content and analyze it, to find possibles outliers
and remove them (if possible).

Data distribution visualization

Feature Engineering

Preprocessing with scikit-learn

  • Encode categorical variable gender (One Hot encoder)
  • Impute missing values
  • Normalize feature distributions

Things to notice:

  • There are 3 genre recorded in the dataset (F, M, O) which is going to be encoded during the feature engineering phase.
  • The scatterplot shows some odd structure but it should be because the data was simulated, also it doesn't look to be outliers.

Train the model

We used different ML models, one to predict the BOGO, the other to
predict the Discount and another to predict the informational offer. For each model, we try two algorithms, Linear Learner and Amazon Sage maker XG-Boost built in Algorithm.

For each algorithm, we tune the hyper-parameters to find the one with the best performance.
With the best two models, we can combine the results in order to obtain a single type of offer for
the user.

Evaluate the model we measure and compare the performances of the models with the
current benchmark, to learn if the proposed solution is viable for the offer attribution process.


The 3 notebooks and the data used for this project can be found in the following github repository:

  • data-exploration.ipynb
  • feature_engineering.ipynb
  • training_the_model.ipynb