Classification and Hypothesis Testing: Hotel Booking Cancellation Prediction

by Keanu Sida


Context

A significant number of hotel bookings are called off due to cancellations or no-shows. Typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost. This may be beneficial to hotel guests, but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

This pattern of cancellations of bookings impacts a hotel on various fronts:

  1. Loss of resources (revenue) when the hotel cannot resell the room.
  2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  3. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  4. Human resources to make arrangements for the guests.

Objective

This increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal - they are facing problems with this high number of booking cancellations and have reached out to your firm for data-driven solutions. You, as a Data Scientist, have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below:

Data Dictionary

  • Booking_ID: Unique identifier of each booking
  • no_of_adults: Number of adults
  • no_of_children: Number of children
  • no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • no_of_week_nights: Number of weekday nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • type_of_meal_plan: Type of meal plan booked by the customer:
    • Not Selected – No meal plan selected
    • Meal Plan 1 – Breakfast
    • Meal Plan 2 – Half board (breakfast and one other meal)
    • Meal Plan 3 – Full board (breakfast, lunch, and dinner)
  • required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
  • room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
  • lead_time: Number of days between the date of booking and the arrival date
  • arrival_year: Year of arrival date
  • arrival_month: Month of arrival date
  • arrival_date: Date of the month
  • market_segment_type: Market segment designation.
  • repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
  • no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
  • no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
  • avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
  • no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
  • booking_status: Flag indicating if the booking was canceled or not.

Importing the libraries required

In [2]:
# Importing the basic libraries we will require for the project

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Importing the Machine Learning models we require from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

# Importing the other functions we may require from Scikit-Learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder

# To get diferent metric scores
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,plot_confusion_matrix,precision_recall_curve,roc_curve,make_scorer

# Code to ignore warnings from function usage
import warnings;
import numpy as np
warnings.filterwarnings('ignore')

Loading the dataset

In [3]:
hotel = pd.read_csv("INNHotelsGroup.csv")
In [4]:
# Copying data to another variable to avoid any changes to original data
data = hotel.copy()

Overview of the dataset

Let's view the first few rows and last few rows of the dataset in order to understand its structure a little better. We will use the head() and tail() methods from Pandas to do this.

In [5]:
data.head()
Out[5]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 INN00001 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00 0 Not_Canceled
1 INN00002 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68 1 Not_Canceled
2 INN00003 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00 0 Canceled
3 INN00004 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00 0 Canceled
4 INN00005 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50 0 Canceled
In [6]:
data.tail()
Out[6]:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
36270 INN36271 3 0 2 6 Meal Plan 1 0 Room_Type 4 85 2018 8 3 Online 0 0 0 167.80 1 Not_Canceled
36271 INN36272 2 0 1 3 Meal Plan 1 0 Room_Type 1 228 2018 10 17 Online 0 0 0 90.95 2 Canceled
36272 INN36273 2 0 2 6 Meal Plan 1 0 Room_Type 1 148 2018 7 1 Online 0 0 0 98.39 2 Not_Canceled
36273 INN36274 2 0 0 3 Not Selected 0 Room_Type 1 63 2018 4 21 Online 0 0 0 94.50 0 Canceled
36274 INN36275 2 0 1 2 Meal Plan 1 0 Room_Type 1 207 2018 12 30 Offline 0 0 0 161.67 0 Not_Canceled

Understand the shape of the dataset

In [7]:
data.shape
Out[7]:
(36275, 19)
  • The dataset has 36275 rows and 19 columns.

Check the data types of the columns for the dataset

In [8]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date                          36275 non-null  int64  
 12  market_segment_type                   36275 non-null  object 
 13  repeated_guest                        36275 non-null  int64  
 14  no_of_previous_cancellations          36275 non-null  int64  
 15  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 16  avg_price_per_room                    36275 non-null  float64
 17  no_of_special_requests                36275 non-null  int64  
 18  booking_status                        36275 non-null  object 
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB
  • Booking_ID, type_of_meal_plan, room_type_reserved, market_segment_type, and booking_status are of object type while rest columns are numeric in nature.

  • There are no null values in the dataset.

Dropping duplicate values

In [9]:
# checking for duplicate values
data.duplicated().sum()
Out[9]:
0
  • There are no duplicate values in the data.

Dropping the unique values column

Let's drop the Booking_ID column first before we proceed forward, as a column with unique values will have almost no predictive power for the Machine Learning problem at hand.

In [10]:
data = data.drop(["Booking_ID"], axis=1)
In [11]:
data.head()
Out[11]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00 0 Not_Canceled
1 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68 1 Not_Canceled
2 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00 0 Canceled
3 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00 0 Canceled
4 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50 0 Canceled

Checking the summary statistics of the dataset

Let's check the statistical summary of the data to better understand the data.

In [12]:
data.describe().T
Out[12]:
count mean std min 25% 50% 75% max
no_of_adults 36275.0 1.844962 0.518715 0.0 2.0 2.00 2.0 4.0
no_of_children 36275.0 0.105279 0.402648 0.0 0.0 0.00 0.0 10.0
no_of_weekend_nights 36275.0 0.810724 0.870644 0.0 0.0 1.00 2.0 7.0
no_of_week_nights 36275.0 2.204300 1.410905 0.0 1.0 2.00 3.0 17.0
required_car_parking_space 36275.0 0.030986 0.173281 0.0 0.0 0.00 0.0 1.0
lead_time 36275.0 85.232557 85.930817 0.0 17.0 57.00 126.0 443.0
arrival_year 36275.0 2017.820427 0.383836 2017.0 2018.0 2018.00 2018.0 2018.0
arrival_month 36275.0 7.423653 3.069894 1.0 5.0 8.00 10.0 12.0
arrival_date 36275.0 15.596995 8.740447 1.0 8.0 16.00 23.0 31.0
repeated_guest 36275.0 0.025637 0.158053 0.0 0.0 0.00 0.0 1.0
no_of_previous_cancellations 36275.0 0.023349 0.368331 0.0 0.0 0.00 0.0 13.0
no_of_previous_bookings_not_canceled 36275.0 0.153411 1.754171 0.0 0.0 0.00 0.0 58.0
avg_price_per_room 36275.0 103.423539 35.089424 0.0 80.3 99.45 120.0 540.0
no_of_special_requests 36275.0 0.619655 0.786236 0.0 0.0 0.00 1.0 5.0

The above data provide insights into the central tendency and spread of each of the variables. The average visitor party consists of roughly 2 adults and no kids for a total of 2 weekdays and one weekend. Very few require parking spaces. The average arrival is in mid July of 2018. Few repeated guests exist and few previous cancellations are evident, but an extremely high relative standard deviation exists in the cancellation variable and the previous bookings not cancelled variable. An average room costs 103 euros but a significant outlier of 540 exists in this attribute. Most guests make 1 or fewer special requests.

Exploratory Data Analysis

Univariate Analysis

Let's explore these variables in some more depth by observing their distributions.

We will first define a hist_box() function that provides both a boxplot and a histogram in the same visual, with which we can perform univariate analysis on the columns of this dataset.

In [13]:
# Defining the hist_box() function
def hist_box(data,col):
  f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(12,6))
  # Adding a graph in each part
  sns.boxplot(data[col], ax=ax_box, showmeans=True)
  sns.distplot(data[col], ax=ax_hist)
  plt.show()

Plotting the histogram and box plot for the variable Lead Time using the hist_box function.

In [14]:
hist_box(data, "lead_time")

The lead time data are strongly skewed right with a great many outliers on the right tail.

Plotting the histogram and box plot for the variable Average Price per Room using the hist_box function.

In [15]:
hist_box(data, "avg_price_per_room")

The average price per room data have a more normal distribution. Outliers still exist in both the high and low tails, but the interquartile range is compact.

Interestingly, some rooms have a price equal to 0. Let's check them.

In [16]:
data[data["avg_price_per_room"] == 0]
Out[16]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
63 1 0 0 1 Meal Plan 1 0 Room_Type 1 2 2017 9 10 Complementary 0 0 0 0.0 1 Not_Canceled
145 1 0 0 2 Meal Plan 1 0 Room_Type 1 13 2018 6 1 Complementary 1 3 5 0.0 1 Not_Canceled
209 1 0 0 0 Meal Plan 1 0 Room_Type 1 4 2018 2 27 Complementary 0 0 0 0.0 1 Not_Canceled
266 1 0 0 2 Meal Plan 1 0 Room_Type 1 1 2017 8 12 Complementary 1 0 1 0.0 1 Not_Canceled
267 1 0 2 1 Meal Plan 1 0 Room_Type 1 4 2017 8 23 Complementary 0 0 0 0.0 1 Not_Canceled
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
35983 1 0 0 1 Meal Plan 1 0 Room_Type 7 0 2018 6 7 Complementary 1 4 17 0.0 1 Not_Canceled
36080 1 0 1 1 Meal Plan 1 0 Room_Type 7 0 2018 3 21 Complementary 1 3 15 0.0 1 Not_Canceled
36114 1 0 0 1 Meal Plan 1 0 Room_Type 1 1 2018 3 2 Online 0 0 0 0.0 0 Not_Canceled
36217 2 0 2 1 Meal Plan 1 0 Room_Type 2 3 2017 8 9 Online 0 0 0 0.0 2 Not_Canceled
36250 1 0 0 2 Meal Plan 2 0 Room_Type 1 6 2017 12 10 Online 0 0 0 0.0 0 Not_Canceled

545 rows × 18 columns

  • There are quite a few hotel rooms which have a price equal to 0.
  • In the market segment column, it looks like many values are complementary.
In [17]:
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
Out[17]:
Complementary    354
Online           191
Name: market_segment_type, dtype: int64
  • It makes sense that most values with room prices equal to 0 are the rooms given as complimentary service from the hotel.
  • The rooms booked online must be a part of some promotional campaign done by the hotel.
In [18]:
# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(0.25)

# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(0.75)

# Calculating IQR
IQR = Q3 - Q1

# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker
Out[18]:
179.55
In [19]:
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker

Let's understand the distribution of the categorical variables

Number of Children

In [20]:
sns.countplot(data['no_of_children'])
plt.show()
In [21]:
data['no_of_children'].value_counts(normalize=True)
Out[21]:
0     0.925624
1     0.044604
2     0.029166
3     0.000524
9     0.000055
10    0.000028
Name: no_of_children, dtype: float64
  • Customers were not travelling with children in 93% of cases.
  • There are some values in the data where the number of children is 9 or 10, which is highly unlikely.
  • We will replace these values with the maximum value of 3 children.
In [22]:
# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)

Arrival Month

In [23]:
sns.countplot(data["arrival_month"])
plt.show()
In [24]:
data['arrival_month'].value_counts(normalize=True)
Out[24]:
10    0.146575
9     0.127112
8     0.105114
6     0.088298
12    0.083280
11    0.082150
7     0.080496
4     0.075424
5     0.071620
3     0.065003
2     0.046975
1     0.027953
Name: arrival_month, dtype: float64
  • October is the busiest month for hotel arrivals followed by September and August. Over 35% of all bookings, as we see in the above table, were for one of these three months.
  • Around 14.7% of the bookings were made for an October arrival.

Booking Status

In [25]:
sns.countplot(data["booking_status"])
plt.show()
In [26]:
data['booking_status'].value_counts(normalize=True)
Out[26]:
Not_Canceled    0.672364
Canceled        0.327636
Name: booking_status, dtype: float64
  • 32.8% of the bookings were canceled by the customers.

Let's encode Canceled bookings to 1 and Not_Canceled as 0 for further analysis

In [27]:
data["booking_status"] = data["booking_status"].apply(
    lambda x: 1 if x == "Canceled" else 0
)

Bivariate Analysis

Finding and visualizing the correlation matrix using a heatmap:

In [28]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(12, 7))
sns.heatmap(data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Several strong positive correlations exist within the data: average price per room and number of adults/children; lead time and booking status; number of previous bookings not cancelled and repeat guest status; and number of previous cancellations and previous non-cancellations. Strong negative correlations include arrival month and arrival year; number of special requests and booking status; and, to a lesser extent, number of adults and repeated guest status. These correlations, when analyzed independently, all seem to have a reason. As an example, more consecutive previous bookings not cancelled provides a greater likelihood that the guest will be a repeated guest because he/she has followed through on more travel plans in general, one or more of which potentially being at this hotel.

Hotel rates are dynamic and change according to demand and customer demographics. Let's see how prices vary across different market segments:

In [29]:
plt.figure(figsize=(10, 6))
sns.boxplot(
    data=data, x="market_segment_type", y="avg_price_per_room", palette="gist_rainbow"
)
plt.show()
  • Rooms booked online have high variations in prices.
  • The offline and corporate room prices are almost similar.
  • Complementary market segment gets the rooms at very low prices, which makes sense.

We will define a stacked barplot() function to help analyse how the target variable varies across predictor categories.

In [30]:
# Defining the stacked_barplot() function
def stacked_barplot(data,predictor,target,figsize=(10,6)):
  (pd.crosstab(data[predictor],data[target],normalize='index')*100).plot(kind='bar',figsize=figsize,stacked=True)
  plt.legend(loc="lower right")
  plt.ylabel('Percentage Cancellations %')

Plotting the stacked barplot for the variable Market Segment Type against the target variable Booking Status using the stacked_barplot function provided and write your insights:

In [31]:
stacked_barplot(data, "market_segment_type", "booking_status")

Among the given segment types, complimentary trips seldom if ever cancel, but targeting this segment is not straightforward as the company doesn't directly benefit financially from giving away trips. On the other hand, the online segment provides the greatest relative share of cancellations. The corporate segment shows the greatest potential as it pertains to risk aversion on the basis of cancellation.

Plotting the stacked barplot for the variable Repeated Guest against the target variable Booking Status using the stacked_barplot function provided and write your insights:

Important note: repeating guests are the guests who stay in the hotel often and are important to brand equity.

In [32]:
stacked_barplot(data, "repeated_guest", "booking_status")

Repeated guests obviously carry extreme importance both for brand equity and for reliability in avoiding trip cancellation, with an over 30% jump from non-repeats in cancellation probability.

Let's analyze the customer who stayed for at least a day at the hotel.

In [33]:
stay_data = data[(data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)]
stay_data["total_days"] = (stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"])

stacked_barplot(stay_data, "total_days", "booking_status",figsize=(15,6))
  • The general trend is that the chances of cancellation increase as the number of days the customer planned to stay at the hotel increases.

As hotel room prices are dynamic, Let's see how the prices vary across different months

In [34]:
plt.figure(figsize=(10, 5))
sns.lineplot(y=data["avg_price_per_room"], x=data["arrival_month"], ci=None)
plt.show()
  • The price of rooms is highest in May to September - around 115 euros per room.

Data Preparation for Modeling

  • We want to predict which bookings will be canceled.
  • Before we proceed to build a model, we'll have to encode categorical features.
  • We'll split the data into train and test to be able to evaluate the model that we build on the train data.

Separating the independent variables (X) and the dependent variable (Y)

In [35]:
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]

X = pd.get_dummies(X, drop_first=True) # Encoding the Categorical features

Splitting the data into a 70% train and 30% test set

Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use the stratified sampling technique to ensure that relative class frequencies are approximately preserved in each train and validation fold.

In [36]:
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30,stratify=Y, random_state=1)
In [37]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (25392, 27)
Shape of test set :  (10883, 27)
Percentage of classes in training set:
0    0.672377
1    0.327623
Name: booking_status, dtype: float64
Percentage of classes in test set:
0    0.672333
1    0.327667
Name: booking_status, dtype: float64

Model Evaluation Criterion

Model can make wrong predictions as:

  1. Predicting a customer will not cancel their booking but in reality, the customer will cancel their booking.
  2. Predicting a customer will cancel their booking but in reality, the customer will not cancel their booking.

Which case is more important?

Both the cases are important as:

  • If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of distribution channels.

  • If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage brand equity.

How to reduce the losses?

  • The hotel would want the F1 Score to be maximized, the greater the F1 score, the higher the chances of minimizing False Negatives and False Positives.

Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.

In [38]:
# Creating metric function 
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))

    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))
    
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Cancelled', 'Cancelled'], yticklabels=['Not Cancelled', 'Cancelled'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

Building the model

We will be building 4 different models:

  • Logistic Regression
  • Support Vector Machine (SVM)
  • Decision Tree
  • Random Forest

Logistic Regression

Building a Logistic Regression model using the sklearn library:

In [39]:
# Fitting logistic regression model
lg = LogisticRegression()
lg.fit(X_train, y_train)
Out[39]:
LogisticRegression()

Checking the performance of the model on train and test data:

In [40]:
# Checking the performance on the training data
y_pred_train = lg.predict(X_train)
metrics_score(y_train, y_pred_train)
              precision    recall  f1-score   support

           0       0.83      0.89      0.86     17073
           1       0.74      0.61      0.67      8319

    accuracy                           0.80     25392
   macro avg       0.78      0.75      0.76     25392
weighted avg       0.80      0.80      0.80     25392

We have created a predictive model capable of predicting cancellation with a precision of 74% and a recall score of 61%.

Let's check the performance on the test set:

In [41]:
# Checking the performance on the test dataset
y_pred_test = lg.predict(X_test)
metrics_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.82      0.89      0.85      7317
           1       0.73      0.60      0.66      3566

    accuracy                           0.80     10883
   macro avg       0.77      0.75      0.76     10883
weighted avg       0.79      0.80      0.79     10883

The given model provides a relatively decent tradeoff between precision and recall (both > 60%) but using the precision-recall curve, we will be able to optimize these metrics.

Finding the optimal threshold for the model using the Precision-Recall Curve:

Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.

Let's use the Precision-Recall curve and see if we can find a better threshold.

In [42]:
# Predict_proba gives the probability of each observation belonging to each class
y_scores_lg=lg.predict_proba(X_train)

precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_train, y_scores_lg[:,1])

# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()

We want to choose a threshold that has a high recall while also having a small drop in precision. We also want to keep some precision. A threshhold value of 0.3 ensure a precision above 60% and a recall around 80%.

In [43]:
# Setting the optimal threshold
optimal_threshold = 0.3

Checking the performance of the model on train and test data using the optimal threshold:

In [44]:
# Creating confusion matrix
y_pred_train = lg.predict_proba(X_train)
metrics_score(y_train, y_pred_train[:,1]>optimal_threshold)
              precision    recall  f1-score   support

           0       0.88      0.76      0.82     17073
           1       0.62      0.80      0.70      8319

    accuracy                           0.77     25392
   macro avg       0.75      0.78      0.76     25392
weighted avg       0.80      0.77      0.78     25392

The model performance has improved as compared to our initial model.The recall has increased by 20%.

In [45]:
y_pred_test = lg.predict_proba(X_test)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold)
              precision    recall  f1-score   support

           0       0.88      0.76      0.81      7317
           1       0.61      0.78      0.69      3566

    accuracy                           0.77     10883
   macro avg       0.74      0.77      0.75     10883
weighted avg       0.79      0.77      0.77     10883

Using the model with a threshold of 0.3, the model has achieved a recall of 78% i.e. increase of 18%. The precision has dropped compared to the inital model but using the optimial threshold, the model's performance is more balanced.

Support Vector Machines

To accelerate SVM training, let's scale the data for support vector machines.

In [46]:
scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train_scaled = scaling.transform(X_train)
X_test_scaled = scaling.transform(X_test)

Let's build the models using the two of the widely used kernel functions:

  1. Linear Kernel
  2. RBF Kernel

Building a Support Vector Machine model using a linear kernel

Note that we are using the scaled data for modeling the SVM.

In [47]:
svm = SVC(kernel='linear',probability=True) # Linear kernal or linear decision boundary
model = svm.fit(X = X_train_scaled, y = y_train)

Checking the performance of the model on train and test data

In [48]:
y_pred_train_svm = model.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)
              precision    recall  f1-score   support

           0       0.83      0.90      0.86     17073
           1       0.74      0.61      0.67      8319

    accuracy                           0.80     25392
   macro avg       0.79      0.76      0.77     25392
weighted avg       0.80      0.80      0.80     25392

This model did a decent job predicting cancellations with a recall score of 90%. Predictions on non-cancellations were much lower (61%).

Checking model performance on test set

In [49]:
y_pred_test_svm = model.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_svm)
              precision    recall  f1-score   support

           0       0.82      0.90      0.86      7317
           1       0.74      0.61      0.67      3566

    accuracy                           0.80     10883
   macro avg       0.78      0.75      0.76     10883
weighted avg       0.80      0.80      0.80     10883

The performance from the training set seems to carry over to the test set, with similar benchmarks shown. We will now use the curve to optimize the precision-recall tradeoff.

Finding the optimal threshold for the model using the Precision-Recall Curve:

In [50]:
# Predict on train data
y_scores_svm=model.predict_proba(X_train_scaled)

precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])

# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()

A threshhold around 0.3 seems to maintain relatively high recall (0.8) while keeping precision above stable (0.6).

In [51]:
optimal_threshold_svm=0.3

Checking the performance of the model on train and test data using the optimal threshold.

In [52]:
y_pred_train_svm = model.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train_svm[:,1]>optimal_threshold_svm)
              precision    recall  f1-score   support

           0       0.88      0.77      0.82     17073
           1       0.63      0.79      0.70      8319

    accuracy                           0.78     25392
   macro avg       0.75      0.78      0.76     25392
weighted avg       0.80      0.78      0.78     25392

As suspected, the model presents a satisfactor recall improvement from 0.61 to 0.79 with a slight precision tradeoff of 0.63 down from 0.74.

In [53]:
y_pred_test = model.predict_proba(X_test_scaled)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold_svm)
              precision    recall  f1-score   support

           0       0.88      0.76      0.82      7317
           1       0.62      0.79      0.70      3566

    accuracy                           0.77     10883
   macro avg       0.75      0.78      0.76     10883
weighted avg       0.80      0.77      0.78     10883

  • SVM model with linear kernel is not overfitting as the accuracy is around 78% for both train and test dataset
  • The model has a Recall of 79% which is highest compared to the above moels.
  • At the optimal threshold of .30, the model's F1 score has improved marginally from 0.67 to 0.70.

Building a Support Vector Machines model using an RBF kernel:

In [54]:
svm_rbf=SVC(kernel='rbf',probability=True)
svm_rbf.fit(X_train_scaled,y_train)
Out[54]:
SVC(probability=True)

Checking the performance of the model on train and test data:

In [55]:
y_pred_train_svm = svm_rbf.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)
              precision    recall  f1-score   support

           0       0.84      0.91      0.88     17073
           1       0.79      0.65      0.71      8319

    accuracy                           0.83     25392
   macro avg       0.81      0.78      0.80     25392
weighted avg       0.82      0.83      0.82     25392

When compared to the baseline svm model with linear kernel, the model's performance on training data has marginally improved using an RBF kernel from 0.70 to 0.71.

Checking model performance on test set:

In [56]:
y_pred_test = svm_rbf.predict(X_test_scaled)

metrics_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.84      0.91      0.87      7317
           1       0.78      0.63      0.70      3566

    accuracy                           0.82     10883
   macro avg       0.81      0.77      0.78     10883
weighted avg       0.82      0.82      0.81     10883

When compared to the baseline svm model with linear kernel, the recall score on testing data has increased from 61% to 63%.

In [57]:
# Predict on train data
y_scores_svm=svm_rbf.predict_proba(X_train_scaled)

precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])

# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
In [58]:
optimal_threshold_svm=0.19

Checking the performance of the model on train and test data using the optimal threshold:

In [59]:
y_pred_train_svm = model.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train_svm[:,1]>optimal_threshold_svm)
              precision    recall  f1-score   support

           0       0.93      0.61      0.73     17073
           1       0.53      0.90      0.67      8319

    accuracy                           0.70     25392
   macro avg       0.73      0.76      0.70     25392
weighted avg       0.80      0.70      0.71     25392

  • SVM model with RBF kernel is performing better compared to the linear kernel.
  • The model has achieved a recall score of 0.78 but there is a slight drop in the precision value.
  • Using the model with a threshold of 0.17, the model gives a better recall score compared to the initial model.
In [60]:
y_pred_test = svm_rbf.predict_proba(X_test_scaled)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold_svm)
              precision    recall  f1-score   support

           0       0.92      0.70      0.80      7317
           1       0.59      0.88      0.70      3566

    accuracy                           0.76     10883
   macro avg       0.75      0.79      0.75     10883
weighted avg       0.81      0.76      0.77     10883

  • The recall score for the model is around 88%.
  • At the optimal threshold of .17, the model performance has improved from 0.63 to 0.88.
  • This is arguably the best performing model when compared to SVM with linear kernel and Logistic Regression as it provides great recall with a relatively minor drop in precision.
  • Further study can be used to determine which levels of recall and precision are desirable to further delineate the quality of each model.

Decision Trees

Building a Decision Tree Model:

In [61]:
model_dt = DecisionTreeClassifier(random_state=1)
model_dt.fit(X_train, y_train)
Out[61]:
DecisionTreeClassifier(random_state=1)

Checking the performance of the model on train and test data:

In [62]:
# Checking performance on the training dataset:
pred_train_dt = model_dt.predict(X_train)
metrics_score(y_train, pred_train_dt)
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     17073
           1       1.00      0.99      0.99      8319

    accuracy                           0.99     25392
   macro avg       1.00      0.99      0.99     25392
weighted avg       0.99      0.99      0.99     25392

  • Almost 0 errors on the training set, each sample has been classified correctly.
  • Model has performed very well on the training set.
  • As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
  • Let's check the performance on test data to see if the model is overfitting.

Checking model performance on test set

In [63]:
pred_test_dt = model_dt.predict(X_test)
metrics_score(y_test, pred_test_dt)
              precision    recall  f1-score   support

           0       0.90      0.90      0.90      7317
           1       0.79      0.79      0.79      3566

    accuracy                           0.87     10883
   macro avg       0.85      0.85      0.85     10883
weighted avg       0.87      0.87      0.87     10883

  • The decision tree model is clearly overfitting. However the decision tree has better performance compared to Logistic Regression and SVM models.
  • We will have to tune the decision tree to reduce the overfitting.

Hyperparameter tuning for the decision tree model using GridSearch CV:

In [64]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 7, 2),
    "max_leaf_nodes": [50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
}


# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, cv=5,scoring='recall',n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[64]:
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=50, min_samples_split=10,
                       random_state=1)

Checking the performance of the model on the train and test data using the tuned model:

Checking performance on the training set

In [65]:
# Checking performance on the training dataset
dt_tuned = estimator.predict(X_train)
metrics_score(y_train,dt_tuned)
              precision    recall  f1-score   support

           0       0.86      0.93      0.89     17073
           1       0.82      0.68      0.75      8319

    accuracy                           0.85     25392
   macro avg       0.84      0.81      0.82     25392
weighted avg       0.85      0.85      0.84     25392

The decision tree in the training set seems to predict fine, but other models previously built have had greater recall at the cost of only slightly reduced precision.

In [66]:
# Checking performance on the training dataset
y_pred_tuned = estimator.predict(X_test)
metrics_score(y_test,y_pred_tuned)
              precision    recall  f1-score   support

           0       0.85      0.93      0.89      7317
           1       0.82      0.67      0.74      3566

    accuracy                           0.84     10883
   macro avg       0.84      0.80      0.81     10883
weighted avg       0.84      0.84      0.84     10883

  • Decision tree model with default parameters is overfitting the training data and is not able to generalize well.
  • The tuned model has provided a generalised performance with balanced precision and recall values but not optimal ones. -Overall, model performance on test data has not significantly improved.

Visualizing the Decision Tree

In [67]:
feature_names = list(X_train.columns)
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,max_depth=3,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

Discerning important features based on the tuned decision tree:

In [69]:
# Importance of features in the tree building

importances = model_dt.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • We can see that the tree has become simpler and the rules of the trees are readable.
  • The model performance of the model has been generalized.
  • We observe that the most important features are:
    • Lead time
    • Average price per room
    • Arrival date

Random Forest

Building a Random Forest Model:

In [70]:
rf_estimator = RandomForestClassifier( random_state = 1)

rf_estimator.fit(X_train, y_train)
Out[70]:
RandomForestClassifier(random_state=1)

Checking the performance of the model on the train and test data:

In [71]:
y_pred_train_rf = rf_estimator.predict(X_train)

metrics_score(y_train, y_pred_train_rf)
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     17073
           1       1.00      0.99      0.99      8319

    accuracy                           0.99     25392
   macro avg       0.99      0.99      0.99     25392
weighted avg       0.99      0.99      0.99     25392

  • Almost 0 errors on the training set, each sample has been classified correctly.
  • Model has performed very well on the training set.
In [72]:
y_pred_test_rf = rf_estimator.predict(X_test)

metrics_score(y_test, y_pred_test_rf)
              precision    recall  f1-score   support

           0       0.91      0.95      0.93      7317
           1       0.88      0.80      0.84      3566

    accuracy                           0.90     10883
   macro avg       0.90      0.88      0.88     10883
weighted avg       0.90      0.90      0.90     10883

  • The Random Forest classifier seems to be fitting well to the data.
  • The recall score is 0.80 which is slightly higher than other models.

Some important features based on the Random Forest:

In [73]:
importances = rf_estimator.feature_importances_

columns = X_train.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)


plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
sns.barplot(importance_df.Importance, importance_df.index,color="violet")
Out[73]:
<AxesSubplot:title={'center':'Feature Importances'}, xlabel='Importance'>
  • The Random Forest further verifies the results from the decision tree, that the most important features are lead time, price per room, and number of special requests.
  • Lead time is the most important feature. If the lead time of a guest is high, he/she is most likely to cancel his/her booking.
  • Price is also a key feature, probably as guests spending more money are most likely book other, cheaper hotels or find alternative accomodations.

Conclusions

  • We have found that corporate bookings have a significantly lower fraction of reservations cancelled. The hotel should thus try and prioritize this market segment to minimize cancellations without resorting to giving away complimentary trips.
  • We saw in our analysis that people with extreme lead times are significantly more likely to cancel their outtings. Imposing some restrictions for booking in advance could be lucrative, as would forcing a deposit or recommitment or providing better communication through reminders.
  • Special requests are often an opportunity to make or break a guest's reservation. Ensuring that anything possible is done to appease the guest during this crucial moment in the guest-hotel relationship can be the difference between a full room and an empty one.
  • Average price per room plays a significant part in coaxing guests to cancel reservations. Incentivizing higher spenders to follow through could provide immense benefit, whether through financial incentives, additional on-site perks, loyalty points, or other situations that provide greater return on investment for those willing to put more money down on initial reservations.