Classification and Hypothesis Testing: Hotel Booking Cancellation Prediction

by Keanu Sida

Context¶

A significant number of hotel bookings are called off due to cancellations or no-shows. Typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost. This may be beneficial to hotel guests, but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

This pattern of cancellations of bookings impacts a hotel on various fronts:

Loss of resources (revenue) when the hotel cannot resell the room.
Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
Human resources to make arrangements for the guests.

Objective¶

This increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal - they are facing problems with this high number of booking cancellations and have reached out to your firm for data-driven solutions. You, as a Data Scientist, have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description¶

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below:

Data Dictionary

Booking_ID: Unique identifier of each booking
no_of_adults: Number of adults
no_of_children: Number of children
no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
no_of_week_nights: Number of weekday nights (Monday to Friday) the guest stayed or booked to stay at the hotel
type_of_meal_plan: Type of meal plan booked by the customer:
- Not Selected – No meal plan selected
- Meal Plan 1 – Breakfast
- Meal Plan 2 – Half board (breakfast and one other meal)
- Meal Plan 3 – Full board (breakfast, lunch, and dinner)
required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
lead_time: Number of days between the date of booking and the arrival date
arrival_year: Year of arrival date
arrival_month: Month of arrival date
arrival_date: Date of the month
market_segment_type: Market segment designation.
repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
booking_status: Flag indicating if the booking was canceled or not.

Importing the libraries required¶

# Importing the basic libraries we will require for the project

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Importing the Machine Learning models we require from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

# Importing the other functions we may require from Scikit-Learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder

# To get diferent metric scores
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,plot_confusion_matrix,precision_recall_curve,roc_curve,make_scorer

# Code to ignore warnings from function usage
import warnings;
import numpy as np
warnings.filterwarnings('ignore')

Loading the dataset¶

hotel = pd.read_csv("INNHotelsGroup.csv")

# Copying data to another variable to avoid any changes to original data
data = hotel.copy()

Overview of the dataset¶

Let's view the first few rows and last few rows of the dataset in order to understand its structure a little better. We will use the head() and tail() methods from Pandas to do this.

data.head()

data.tail()

Understand the shape of the dataset¶

data.shape

(36275, 19)

The dataset has 36275 rows and 19 columns.

Check the data types of the columns for the dataset¶

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date                          36275 non-null  int64  
 12  market_segment_type                   36275 non-null  object 
 13  repeated_guest                        36275 non-null  int64  
 14  no_of_previous_cancellations          36275 non-null  int64  
 15  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 16  avg_price_per_room                    36275 non-null  float64
 17  no_of_special_requests                36275 non-null  int64  
 18  booking_status                        36275 non-null  object 
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB

Booking_ID, type_of_meal_plan, room_type_reserved, market_segment_type, and booking_status are of object type while rest columns are numeric in nature.
There are no null values in the dataset.

Dropping duplicate values¶

# checking for duplicate values
data.duplicated().sum()

0

There are no duplicate values in the data.

Dropping the unique values column¶

Let's drop the Booking_ID column first before we proceed forward, as a column with unique values will have almost no predictive power for the Machine Learning problem at hand.

data = data.drop(["Booking_ID"], axis=1)

data.head()

Checking the summary statistics of the dataset

Let's check the statistical summary of the data to better understand the data.

data.describe().T

The above data provide insights into the central tendency and spread of each of the variables. The average visitor party consists of roughly 2 adults and no kids for a total of 2 weekdays and one weekend. Very few require parking spaces. The average arrival is in mid July of 2018. Few repeated guests exist and few previous cancellations are evident, but an extremely high relative standard deviation exists in the cancellation variable and the previous bookings not cancelled variable. An average room costs 103 euros but a significant outlier of 540 exists in this attribute. Most guests make 1 or fewer special requests.

Exploratory Data Analysis¶

Univariate Analysis¶

Let's explore these variables in some more depth by observing their distributions.

We will first define a hist_box() function that provides both a boxplot and a histogram in the same visual, with which we can perform univariate analysis on the columns of this dataset.

# Defining the hist_box() function
def hist_box(data,col):
  f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(12,6))
  # Adding a graph in each part
  sns.boxplot(data[col], ax=ax_box, showmeans=True)
  sns.distplot(data[col], ax=ax_hist)
  plt.show()

Plotting the histogram and box plot for the variable `Lead Time` using the hist_box function.¶

hist_box(data, "lead_time")

The lead time data are strongly skewed right with a great many outliers on the right tail.

Plotting the histogram and box plot for the variable `Average Price per Room` using the hist_box function.¶

hist_box(data, "avg_price_per_room")

The average price per room data have a more normal distribution. Outliers still exist in both the high and low tails, but the interquartile range is compact.

Interestingly, some rooms have a price equal to 0. Let's check them.

data[data["avg_price_per_room"] == 0]

There are quite a few hotel rooms which have a price equal to 0.
In the market segment column, it looks like many values are complementary.

data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()

Complementary    354
Online           191
Name: market_segment_type, dtype: int64

It makes sense that most values with room prices equal to 0 are the rooms given as complimentary service from the hotel.
The rooms booked online must be a part of some promotional campaign done by the hotel.

# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(0.25)

# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(0.75)

# Calculating IQR
IQR = Q3 - Q1

# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker

179.55

# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker

Let's understand the distribution of the categorical variables¶

Number of Children

sns.countplot(data['no_of_children'])
plt.show()

data['no_of_children'].value_counts(normalize=True)

0     0.925624
1     0.044604
2     0.029166
3     0.000524
9     0.000055
10    0.000028
Name: no_of_children, dtype: float64

Customers were not travelling with children in 93% of cases.
There are some values in the data where the number of children is 9 or 10, which is highly unlikely.
We will replace these values with the maximum value of 3 children.

# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)

Arrival Month

sns.countplot(data["arrival_month"])
plt.show()

data['arrival_month'].value_counts(normalize=True)

10    0.146575
9     0.127112
8     0.105114
6     0.088298
12    0.083280
11    0.082150
7     0.080496
4     0.075424
5     0.071620
3     0.065003
2     0.046975
1     0.027953
Name: arrival_month, dtype: float64

October is the busiest month for hotel arrivals followed by September and August. Over 35% of all bookings, as we see in the above table, were for one of these three months.
Around 14.7% of the bookings were made for an October arrival.

Booking Status

sns.countplot(data["booking_status"])
plt.show()

data['booking_status'].value_counts(normalize=True)

Not_Canceled    0.672364
Canceled        0.327636
Name: booking_status, dtype: float64

32.8% of the bookings were canceled by the customers.

Let's encode Canceled bookings to 1 and Not_Canceled as 0 for further analysis

data["booking_status"] = data["booking_status"].apply(
    lambda x: 1 if x == "Canceled" else 0
)

Bivariate Analysis¶

Finding and visualizing the correlation matrix using a heatmap:¶

cols_list = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(12, 7))
sns.heatmap(data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Several strong positive correlations exist within the data: average price per room and number of adults/children; lead time and booking status; number of previous bookings not cancelled and repeat guest status; and number of previous cancellations and previous non-cancellations. Strong negative correlations include arrival month and arrival year; number of special requests and booking status; and, to a lesser extent, number of adults and repeated guest status. These correlations, when analyzed independently, all seem to have a reason. As an example, more consecutive previous bookings not cancelled provides a greater likelihood that the guest will be a repeated guest because he/she has followed through on more travel plans in general, one or more of which potentially being at this hotel.

Hotel rates are dynamic and change according to demand and customer demographics. Let's see how prices vary across different market segments:

plt.figure(figsize=(10, 6))
sns.boxplot(
    data=data, x="market_segment_type", y="avg_price_per_room", palette="gist_rainbow"
)
plt.show()

Rooms booked online have high variations in prices.
The offline and corporate room prices are almost similar.
Complementary market segment gets the rooms at very low prices, which makes sense.

We will define a stacked barplot() function to help analyse how the target variable varies across predictor categories.

# Defining the stacked_barplot() function
def stacked_barplot(data,predictor,target,figsize=(10,6)):
  (pd.crosstab(data[predictor],data[target],normalize='index')*100).plot(kind='bar',figsize=figsize,stacked=True)
  plt.legend(loc="lower right")
  plt.ylabel('Percentage Cancellations %')

Plotting the stacked barplot for the variable `Market Segment Type` against the target variable `Booking Status` using the stacked_barplot function provided and write your insights:¶

stacked_barplot(data, "market_segment_type", "booking_status")

Among the given segment types, complimentary trips seldom if ever cancel, but targeting this segment is not straightforward as the company doesn't directly benefit financially from giving away trips. On the other hand, the online segment provides the greatest relative share of cancellations. The corporate segment shows the greatest potential as it pertains to risk aversion on the basis of cancellation.

Plotting the stacked barplot for the variable `Repeated Guest` against the target variable `Booking Status` using the stacked_barplot function provided and write your insights:¶

Important note: repeating guests are the guests who stay in the hotel often and are important to brand equity.

stacked_barplot(data, "repeated_guest", "booking_status")

Repeated guests obviously carry extreme importance both for brand equity and for reliability in avoiding trip cancellation, with an over 30% jump from non-repeats in cancellation probability.

Let's analyze the customer who stayed for at least a day at the hotel.

stay_data = data[(data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)]
stay_data["total_days"] = (stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"])

stacked_barplot(stay_data, "total_days", "booking_status",figsize=(15,6))

The general trend is that the chances of cancellation increase as the number of days the customer planned to stay at the hotel increases.

As hotel room prices are dynamic, Let's see how the prices vary across different months

plt.figure(figsize=(10, 5))
sns.lineplot(y=data["avg_price_per_room"], x=data["arrival_month"], ci=None)
plt.show()

The price of rooms is highest in May to September - around 115 euros per room.

Data Preparation for Modeling¶

We want to predict which bookings will be canceled.
Before we proceed to build a model, we'll have to encode categorical features.
We'll split the data into train and test to be able to evaluate the model that we build on the train data.

Separating the independent variables (X) and the dependent variable (Y)

X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]

X = pd.get_dummies(X, drop_first=True) # Encoding the Categorical features

Splitting the data into a 70% train and 30% test set

Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use the stratified sampling technique to ensure that relative class frequencies are approximately preserved in each train and validation fold.

# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30,stratify=Y, random_state=1)

print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))

Shape of Training set :  (25392, 27)
Shape of test set :  (10883, 27)
Percentage of classes in training set:
0    0.672377
1    0.327623
Name: booking_status, dtype: float64
Percentage of classes in test set:
0    0.672333
1    0.327667
Name: booking_status, dtype: float64

Model Evaluation Criterion¶

Model can make wrong predictions as:¶

Predicting a customer will not cancel their booking but in reality, the customer will cancel their booking.
Predicting a customer will cancel their booking but in reality, the customer will not cancel their booking.

Which case is more important?¶

Both the cases are important as:

If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of distribution channels.
If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage brand equity.

How to reduce the losses?¶

The hotel would want the F1 Score to be maximized, the greater the F1 score, the higher the chances of minimizing False Negatives and False Positives.

Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.

# Creating metric function 
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))

    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))
    
    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Cancelled', 'Cancelled'], yticklabels=['Not Cancelled', 'Cancelled'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

Building the model¶

We will be building 4 different models:

Logistic Regression
Support Vector Machine (SVM)
Decision Tree
Random Forest

Logistic Regression¶

Building a Logistic Regression model using the sklearn library:¶

# Fitting logistic regression model
lg = LogisticRegression()
lg.fit(X_train, y_train)

LogisticRegression()

Checking the performance of the model on train and test data:¶

# Checking the performance on the training data
y_pred_train = lg.predict(X_train)
metrics_score(y_train, y_pred_train)

              precision    recall  f1-score   support

           0       0.83      0.89      0.86     17073
           1       0.74      0.61      0.67      8319

    accuracy                           0.80     25392
   macro avg       0.78      0.75      0.76     25392
weighted avg       0.80      0.80      0.80     25392

We have created a predictive model capable of predicting cancellation with a precision of 74% and a recall score of 61%.

Let's check the performance on the test set:

# Checking the performance on the test dataset
y_pred_test = lg.predict(X_test)
metrics_score(y_test, y_pred_test)

              precision    recall  f1-score   support

           0       0.82      0.89      0.85      7317
           1       0.73      0.60      0.66      3566

    accuracy                           0.80     10883
   macro avg       0.77      0.75      0.76     10883
weighted avg       0.79      0.80      0.79     10883

The given model provides a relatively decent tradeoff between precision and recall (both > 60%) but using the precision-recall curve, we will be able to optimize these metrics.

Finding the optimal threshold for the model using the Precision-Recall Curve:¶

Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.

Let's use the Precision-Recall curve and see if we can find a better threshold.

# Predict_proba gives the probability of each observation belonging to each class
y_scores_lg=lg.predict_proba(X_train)

precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_train, y_scores_lg[:,1])

# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()

We want to choose a threshold that has a high recall while also having a small drop in precision. We also want to keep some precision. A threshhold value of 0.3 ensure a precision above 60% and a recall around 80%.

# Setting the optimal threshold
optimal_threshold = 0.3

Checking the performance of the model on train and test data using the optimal threshold:¶

# Creating confusion matrix
y_pred_train = lg.predict_proba(X_train)
metrics_score(y_train, y_pred_train[:,1]>optimal_threshold)

              precision    recall  f1-score   support

           0       0.88      0.76      0.82     17073
           1       0.62      0.80      0.70      8319

    accuracy                           0.77     25392
   macro avg       0.75      0.78      0.76     25392
weighted avg       0.80      0.77      0.78     25392

The model performance has improved as compared to our initial model.The recall has increased by 20%.

y_pred_test = lg.predict_proba(X_test)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold)

              precision    recall  f1-score   support

           0       0.88      0.76      0.81      7317
           1       0.61      0.78      0.69      3566

    accuracy                           0.77     10883
   macro avg       0.74      0.77      0.75     10883
weighted avg       0.79      0.77      0.77     10883

Using the model with a threshold of 0.3, the model has achieved a recall of 78% i.e. increase of 18%. The precision has dropped compared to the inital model but using the optimial threshold, the model's performance is more balanced.

Support Vector Machines¶

To accelerate SVM training, let's scale the data for support vector machines.

scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train_scaled = scaling.transform(X_train)
X_test_scaled = scaling.transform(X_test)

Let's build the models using the two of the widely used kernel functions:

Linear Kernel
RBF Kernel

Building a Support Vector Machine model using a linear kernel¶

Note that we are using the scaled data for modeling the SVM.

svm = SVC(kernel='linear',probability=True) # Linear kernal or linear decision boundary
model = svm.fit(X = X_train_scaled, y = y_train)

Checking the performance of the model on train and test data¶

y_pred_train_svm = model.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)

              precision    recall  f1-score   support

           0       0.83      0.90      0.86     17073
           1       0.74      0.61      0.67      8319

    accuracy                           0.80     25392
   macro avg       0.79      0.76      0.77     25392
weighted avg       0.80      0.80      0.80     25392

This model did a decent job predicting cancellations with a recall score of 90%. Predictions on non-cancellations were much lower (61%).

Checking model performance on test set

y_pred_test_svm = model.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_svm)

              precision    recall  f1-score   support

           0       0.82      0.90      0.86      7317
           1       0.74      0.61      0.67      3566

    accuracy                           0.80     10883
   macro avg       0.78      0.75      0.76     10883
weighted avg       0.80      0.80      0.80     10883

The performance from the training set seems to carry over to the test set, with similar benchmarks shown. We will now use the curve to optimize the precision-recall tradeoff.

Finding the optimal threshold for the model using the Precision-Recall Curve: ¶

# Predict on train data
y_scores_svm=model.predict_proba(X_train_scaled)

precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])

# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()

A threshhold around 0.3 seems to maintain relatively high recall (0.8) while keeping precision above stable (0.6).

optimal_threshold_svm=0.3

Checking the performance of the model on train and test data using the optimal threshold.¶

y_pred_train_svm = model.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train_svm[:,1]>optimal_threshold_svm)

              precision    recall  f1-score   support

           0       0.88      0.77      0.82     17073
           1       0.63      0.79      0.70      8319

    accuracy                           0.78     25392
   macro avg       0.75      0.78      0.76     25392
weighted avg       0.80      0.78      0.78     25392

As suspected, the model presents a satisfactor recall improvement from 0.61 to 0.79 with a slight precision tradeoff of 0.63 down from 0.74.

y_pred_test = model.predict_proba(X_test_scaled)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold_svm)

              precision    recall  f1-score   support

           0       0.88      0.76      0.82      7317
           1       0.62      0.79      0.70      3566

    accuracy                           0.77     10883
   macro avg       0.75      0.78      0.76     10883
weighted avg       0.80      0.77      0.78     10883

SVM model with linear kernel is not overfitting as the accuracy is around 78% for both train and test dataset
The model has a Recall of 79% which is highest compared to the above moels.
At the optimal threshold of .30, the model's F1 score has improved marginally from 0.67 to 0.70.

Building a Support Vector Machines model using an RBF kernel:¶

svm_rbf=SVC(kernel='rbf',probability=True)
svm_rbf.fit(X_train_scaled,y_train)

SVC(probability=True)

Checking the performance of the model on train and test data:¶

y_pred_train_svm = svm_rbf.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)

              precision    recall  f1-score   support

           0       0.84      0.91      0.88     17073
           1       0.79      0.65      0.71      8319

    accuracy                           0.83     25392
   macro avg       0.81      0.78      0.80     25392
weighted avg       0.82      0.83      0.82     25392

When compared to the baseline svm model with linear kernel, the model's performance on training data has marginally improved using an RBF kernel from 0.70 to 0.71.

Checking model performance on test set:¶

y_pred_test = svm_rbf.predict(X_test_scaled)

metrics_score(y_test, y_pred_test)

              precision    recall  f1-score   support

           0       0.84      0.91      0.87      7317
           1       0.78      0.63      0.70      3566

    accuracy                           0.82     10883
   macro avg       0.81      0.77      0.78     10883
weighted avg       0.82      0.82      0.81     10883

When compared to the baseline svm model with linear kernel, the recall score on testing data has increased from 61% to 63%.

# Predict on train data
y_scores_svm=svm_rbf.predict_proba(X_train_scaled)

precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])

# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()

optimal_threshold_svm=0.19

Checking the performance of the model on train and test data using the optimal threshold:¶

y_pred_train_svm = model.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train_svm[:,1]>optimal_threshold_svm)

              precision    recall  f1-score   support

           0       0.93      0.61      0.73     17073
           1       0.53      0.90      0.67      8319

    accuracy                           0.70     25392
   macro avg       0.73      0.76      0.70     25392
weighted avg       0.80      0.70      0.71     25392

SVM model with RBF kernel is performing better compared to the linear kernel.
The model has achieved a recall score of 0.78 but there is a slight drop in the precision value.
Using the model with a threshold of 0.17, the model gives a better recall score compared to the initial model.

y_pred_test = svm_rbf.predict_proba(X_test_scaled)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold_svm)

              precision    recall  f1-score   support

           0       0.92      0.70      0.80      7317
           1       0.59      0.88      0.70      3566

    accuracy                           0.76     10883
   macro avg       0.75      0.79      0.75     10883
weighted avg       0.81      0.76      0.77     10883

The recall score for the model is around 88%.
At the optimal threshold of .17, the model performance has improved from 0.63 to 0.88.
This is arguably the best performing model when compared to SVM with linear kernel and Logistic Regression as it provides great recall with a relatively minor drop in precision.
Further study can be used to determine which levels of recall and precision are desirable to further delineate the quality of each model.

Decision Trees¶

Building a Decision Tree Model:¶

model_dt = DecisionTreeClassifier(random_state=1)
model_dt.fit(X_train, y_train)

DecisionTreeClassifier(random_state=1)

Checking the performance of the model on train and test data:¶

# Checking performance on the training dataset:
pred_train_dt = model_dt.predict(X_train)
metrics_score(y_train, pred_train_dt)

              precision    recall  f1-score   support

           0       0.99      1.00      1.00     17073
           1       1.00      0.99      0.99      8319

    accuracy                           0.99     25392
   macro avg       1.00      0.99      0.99     25392
weighted avg       0.99      0.99      0.99     25392

Almost 0 errors on the training set, each sample has been classified correctly.
Model has performed very well on the training set.
As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
Let's check the performance on test data to see if the model is overfitting.

Checking model performance on test set¶

pred_test_dt = model_dt.predict(X_test)
metrics_score(y_test, pred_test_dt)

              precision    recall  f1-score   support

           0       0.90      0.90      0.90      7317
           1       0.79      0.79      0.79      3566

    accuracy                           0.87     10883
   macro avg       0.85      0.85      0.85     10883
weighted avg       0.87      0.87      0.87     10883

The decision tree model is clearly overfitting. However the decision tree has better performance compared to Logistic Regression and SVM models.
We will have to tune the decision tree to reduce the overfitting.

Hyperparameter tuning for the decision tree model using GridSearch CV:¶

# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 7, 2),
    "max_leaf_nodes": [50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
}


# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, cv=5,scoring='recall',n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=6, max_leaf_nodes=50, min_samples_split=10,
                       random_state=1)

Checking the performance of the model on the train and test data using the tuned model:¶

Checking performance on the training set¶

# Checking performance on the training dataset
dt_tuned = estimator.predict(X_train)
metrics_score(y_train,dt_tuned)

              precision    recall  f1-score   support

           0       0.86      0.93      0.89     17073
           1       0.82      0.68      0.75      8319

    accuracy                           0.85     25392
   macro avg       0.84      0.81      0.82     25392
weighted avg       0.85      0.85      0.84     25392

The decision tree in the training set seems to predict fine, but other models previously built have had greater recall at the cost of only slightly reduced precision.

# Checking performance on the training dataset
y_pred_tuned = estimator.predict(X_test)
metrics_score(y_test,y_pred_tuned)

              precision    recall  f1-score   support

           0       0.85      0.93      0.89      7317
           1       0.82      0.67      0.74      3566

    accuracy                           0.84     10883
   macro avg       0.84      0.80      0.81     10883
weighted avg       0.84      0.84      0.84     10883

Decision tree model with default parameters is overfitting the training data and is not able to generalize well.
The tuned model has provided a generalised performance with balanced precision and recall values but not optimal ones. -Overall, model performance on test data has not significantly improved.

Visualizing the Decision Tree¶

feature_names = list(X_train.columns)
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,max_depth=3,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()

Discerning important features based on the tuned decision tree:¶

# Importance of features in the tree building

importances = model_dt.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

We can see that the tree has become simpler and the rules of the trees are readable.
The model performance of the model has been generalized.

We observe that the most important features are:
- Lead time
- Average price per room
- Arrival date

Random Forest¶

Building a Random Forest Model:¶

rf_estimator = RandomForestClassifier( random_state = 1)

rf_estimator.fit(X_train, y_train)

RandomForestClassifier(random_state=1)

Checking the performance of the model on the train and test data:¶

y_pred_train_rf = rf_estimator.predict(X_train)

metrics_score(y_train, y_pred_train_rf)

              precision    recall  f1-score   support

           0       0.99      1.00      1.00     17073
           1       1.00      0.99      0.99      8319

    accuracy                           0.99     25392
   macro avg       0.99      0.99      0.99     25392
weighted avg       0.99      0.99      0.99     25392

Almost 0 errors on the training set, each sample has been classified correctly.
Model has performed very well on the training set.

y_pred_test_rf = rf_estimator.predict(X_test)

metrics_score(y_test, y_pred_test_rf)

              precision    recall  f1-score   support

           0       0.91      0.95      0.93      7317
           1       0.88      0.80      0.84      3566

    accuracy                           0.90     10883
   macro avg       0.90      0.88      0.88     10883
weighted avg       0.90      0.90      0.90     10883

The Random Forest classifier seems to be fitting well to the data.
The recall score is 0.80 which is slightly higher than other models.

Some important features based on the Random Forest:¶

importances = rf_estimator.feature_importances_

columns = X_train.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)


plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
sns.barplot(importance_df.Importance, importance_df.index,color="violet")

<AxesSubplot:title={'center':'Feature Importances'}, xlabel='Importance'>

The Random Forest further verifies the results from the decision tree, that the most important features are lead time, price per room, and number of special requests.
Lead time is the most important feature. If the lead time of a guest is high, he/she is most likely to cancel his/her booking.
Price is also a key feature, probably as guests spending more money are most likely book other, cheaper hotels or find alternative accomodations.

Conclusions¶

We have found that corporate bookings have a significantly lower fraction of reservations cancelled. The hotel should thus try and prioritize this market segment to minimize cancellations without resorting to giving away complimentary trips.
We saw in our analysis that people with extreme lead times are significantly more likely to cancel their outtings. Imposing some restrictions for booking in advance could be lucrative, as would forcing a deposit or recommitment or providing better communication through reminders.
Special requests are often an opportunity to make or break a guest's reservation. Ensuring that anything possible is done to appease the guest during this crucial moment in the guest-hotel relationship can be the difference between a full room and an empty one.
Average price per room plays a significant part in coaxing guests to cancel reservations. Incentivizing higher spenders to follow through could provide immense benefit, whether through financial incentives, additional on-site perks, loyalty points, or other situations that provide greater return on investment for those willing to put more money down on initial reservations.

	count	mean	std	min	25%	50%	75%	max
no_of_adults	36275.0	1.844962	0.518715	0.0	2.0	2.00	2.0	4.0
no_of_children	36275.0	0.105279	0.402648	0.0	0.0	0.00	0.0	10.0
no_of_weekend_nights	36275.0	0.810724	0.870644	0.0	0.0	1.00	2.0	7.0
no_of_week_nights	36275.0	2.204300	1.410905	0.0	1.0	2.00	3.0	17.0
required_car_parking_space	36275.0	0.030986	0.173281	0.0	0.0	0.00	0.0	1.0
lead_time	36275.0	85.232557	85.930817	0.0	17.0	57.00	126.0	443.0
arrival_year	36275.0	2017.820427	0.383836	2017.0	2018.0	2018.00	2018.0	2018.0
arrival_month	36275.0	7.423653	3.069894	1.0	5.0	8.00	10.0	12.0
arrival_date	36275.0	15.596995	8.740447	1.0	8.0	16.00	23.0	31.0
repeated_guest	36275.0	0.025637	0.158053	0.0	0.0	0.00	0.0	1.0
no_of_previous_cancellations	36275.0	0.023349	0.368331	0.0	0.0	0.00	0.0	13.0
no_of_previous_bookings_not_canceled	36275.0	0.153411	1.754171	0.0	0.0	0.00	0.0	58.0
avg_price_per_room	36275.0	103.423539	35.089424	0.0	80.3	99.45	120.0	540.0
no_of_special_requests	36275.0	0.619655	0.786236	0.0	0.0	0.00	1.0	5.0

	Booking_ID	no_of_adults	no_of_weekend_nights	no_of_week_nights	type_of_meal_plan	room_type_reserved	lead_time	arrival_year	arrival_month	arrival_date	market_segment_type	avg_price_per_room	no_of_special_requests	booking_status
0	INN00001	2	1	2	Meal Plan 1	Room_Type 1	224	2017	10	2	Offline	65.00	0	Not_Canceled
1	INN00002	2	2	3	Not Selected	Room_Type 1	5	2018	11	6	Online	106.68	1	Not_Canceled
2	INN00003	1	2	1	Meal Plan 1	Room_Type 1	1	2018	2	28	Online	60.00	0	Canceled
3	INN00004	2	0	2	Meal Plan 1	Room_Type 1	211	2018	5	20	Online	100.00	0	Canceled
4	INN00005	2	1	1	Not Selected	Room_Type 1	48	2018	4	11	Online	94.50	0	Canceled

	Booking_ID	no_of_adults	no_of_weekend_nights	no_of_week_nights	type_of_meal_plan	room_type_reserved	lead_time	arrival_year	arrival_month	arrival_date	market_segment_type	avg_price_per_room	no_of_special_requests	booking_status
36270	INN36271	3	2	6	Meal Plan 1	Room_Type 4	85	2018	8	3	Online	167.80	1	Not_Canceled
36271	INN36272	2	1	3	Meal Plan 1	Room_Type 1	228	2018	10	17	Online	90.95	2	Canceled
36272	INN36273	2	2	6	Meal Plan 1	Room_Type 1	148	2018	7	1	Online	98.39	2	Not_Canceled
36273	INN36274	2	0	3	Not Selected	Room_Type 1	63	2018	4	21	Online	94.50	0	Canceled
36274	INN36275	2	1	2	Meal Plan 1	Room_Type 1	207	2018	12	30	Offline	161.67	0	Not_Canceled

Classification and Hypothesis Testing: Hotel Booking Cancellation Prediction

by Keanu Sida

Context¶

Objective¶

Data Description¶

Importing the libraries required¶

Loading the dataset¶

Overview of the dataset¶

Understand the shape of the dataset¶

Check the data types of the columns for the dataset¶

Dropping duplicate values¶

Dropping the unique values column¶

Checking the summary statistics of the dataset

Exploratory Data Analysis¶

Univariate Analysis¶

Plotting the histogram and box plot for the variable Lead Time using the hist_box function.¶

Plotting the histogram and box plot for the variable Average Price per Room using the hist_box function.¶

Let's understand the distribution of the categorical variables¶

Bivariate Analysis¶

Finding and visualizing the correlation matrix using a heatmap:¶

Plotting the stacked barplot for the variable Market Segment Type against the target variable Booking Status using the stacked_barplot function provided and write your insights:¶

Plotting the stacked barplot for the variable Repeated Guest against the target variable Booking Status using the stacked_barplot function provided and write your insights:¶

Data Preparation for Modeling¶

Model Evaluation Criterion¶

Model can make wrong predictions as:¶

Which case is more important?¶

How to reduce the losses?¶

Building the model¶

Logistic Regression¶

Building a Logistic Regression model using the sklearn library:¶

Checking the performance of the model on train and test data:¶

Finding the optimal threshold for the model using the Precision-Recall Curve:¶

Checking the performance of the model on train and test data using the optimal threshold:¶

Support Vector Machines¶

Building a Support Vector Machine model using a linear kernel¶

Checking the performance of the model on train and test data¶

Finding the optimal threshold for the model using the Precision-Recall Curve: ¶

Checking the performance of the model on train and test data using the optimal threshold.¶

Building a Support Vector Machines model using an RBF kernel:¶

Checking the performance of the model on train and test data:¶

Checking model performance on test set:¶

Checking the performance of the model on train and test data using the optimal threshold:¶

Decision Trees¶

Building a Decision Tree Model:¶

Checking the performance of the model on train and test data:¶

Checking model performance on test set¶

Hyperparameter tuning for the decision tree model using GridSearch CV:¶

Checking the performance of the model on the train and test data using the tuned model:¶

Checking performance on the training set¶

Visualizing the Decision Tree¶

Discerning important features based on the tuned decision tree:¶

Random Forest¶

Building a Random Forest Model:¶

Checking the performance of the model on the train and test data:¶

Some important features based on the Random Forest:¶

Conclusions¶

Plotting the histogram and box plot for the variable `Lead Time` using the hist_box function.¶

Plotting the histogram and box plot for the variable `Average Price per Room` using the hist_box function.¶

Plotting the stacked barplot for the variable `Market Segment Type` against the target variable `Booking Status` using the stacked_barplot function provided and write your insights:¶

Plotting the stacked barplot for the variable `Repeated Guest` against the target variable `Booking Status` using the stacked_barplot function provided and write your insights:¶