A significant number of hotel bookings are called off due to cancellations or no-shows. Typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost. This may be beneficial to hotel guests, but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
This pattern of cancellations of bookings impacts a hotel on various fronts:
This increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal - they are facing problems with this high number of booking cancellations and have reached out to your firm for data-driven solutions. You, as a Data Scientist, have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below:
Data Dictionary
# Importing the basic libraries we will require for the project
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Importing the Machine Learning models we require from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
# Importing the other functions we may require from Scikit-Learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder
# To get diferent metric scores
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,plot_confusion_matrix,precision_recall_curve,roc_curve,make_scorer
# Code to ignore warnings from function usage
import warnings;
import numpy as np
warnings.filterwarnings('ignore')
hotel = pd.read_csv("INNHotelsGroup.csv")
# Copying data to another variable to avoid any changes to original data
data = hotel.copy()
Let's view the first few rows and last few rows of the dataset in order to understand its structure a little better. We will use the head() and tail() methods from Pandas to do this.
data.head()
data.tail()
data.shape
data.info()
Booking_ID
, type_of_meal_plan
, room_type_reserved
, market_segment_type
, and booking_status
are of object type while rest columns are numeric in nature.
There are no null values in the dataset.
# checking for duplicate values
data.duplicated().sum()
Let's drop the Booking_ID column first before we proceed forward, as a column with unique values will have almost no predictive power for the Machine Learning problem at hand.
data = data.drop(["Booking_ID"], axis=1)
data.head()
Let's check the statistical summary of the data to better understand the data.
data.describe().T
The above data provide insights into the central tendency and spread of each of the variables. The average visitor party consists of roughly 2 adults and no kids for a total of 2 weekdays and one weekend. Very few require parking spaces. The average arrival is in mid July of 2018. Few repeated guests exist and few previous cancellations are evident, but an extremely high relative standard deviation exists in the cancellation variable and the previous bookings not cancelled variable. An average room costs 103 euros but a significant outlier of 540 exists in this attribute. Most guests make 1 or fewer special requests.
Let's explore these variables in some more depth by observing their distributions.
We will first define a hist_box() function that provides both a boxplot and a histogram in the same visual, with which we can perform univariate analysis on the columns of this dataset.
# Defining the hist_box() function
def hist_box(data,col):
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(12,6))
# Adding a graph in each part
sns.boxplot(data[col], ax=ax_box, showmeans=True)
sns.distplot(data[col], ax=ax_hist)
plt.show()
Lead Time
using the hist_box function.¶hist_box(data, "lead_time")
The lead time data are strongly skewed right with a great many outliers on the right tail.
Average Price per Room
using the hist_box function.¶hist_box(data, "avg_price_per_room")
The average price per room data have a more normal distribution. Outliers still exist in both the high and low tails, but the interquartile range is compact.
Interestingly, some rooms have a price equal to 0. Let's check them.
data[data["avg_price_per_room"] == 0]
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(0.25)
# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(0.75)
# Calculating IQR
IQR = Q3 - Q1
# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker
Number of Children
sns.countplot(data['no_of_children'])
plt.show()
data['no_of_children'].value_counts(normalize=True)
# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)
Arrival Month
sns.countplot(data["arrival_month"])
plt.show()
data['arrival_month'].value_counts(normalize=True)
Booking Status
sns.countplot(data["booking_status"])
plt.show()
data['booking_status'].value_counts(normalize=True)
Let's encode Canceled bookings to 1 and Not_Canceled as 0 for further analysis
data["booking_status"] = data["booking_status"].apply(
lambda x: 1 if x == "Canceled" else 0
)
cols_list = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(12, 7))
sns.heatmap(data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Several strong positive correlations exist within the data: average price per room and number of adults/children; lead time and booking status; number of previous bookings not cancelled and repeat guest status; and number of previous cancellations and previous non-cancellations. Strong negative correlations include arrival month and arrival year; number of special requests and booking status; and, to a lesser extent, number of adults and repeated guest status. These correlations, when analyzed independently, all seem to have a reason. As an example, more consecutive previous bookings not cancelled provides a greater likelihood that the guest will be a repeated guest because he/she has followed through on more travel plans in general, one or more of which potentially being at this hotel.
Hotel rates are dynamic and change according to demand and customer demographics. Let's see how prices vary across different market segments:
plt.figure(figsize=(10, 6))
sns.boxplot(
data=data, x="market_segment_type", y="avg_price_per_room", palette="gist_rainbow"
)
plt.show()
We will define a stacked barplot() function to help analyse how the target variable varies across predictor categories.
# Defining the stacked_barplot() function
def stacked_barplot(data,predictor,target,figsize=(10,6)):
(pd.crosstab(data[predictor],data[target],normalize='index')*100).plot(kind='bar',figsize=figsize,stacked=True)
plt.legend(loc="lower right")
plt.ylabel('Percentage Cancellations %')
Market Segment Type
against the target variable Booking Status
using the stacked_barplot function provided and write your insights:¶stacked_barplot(data, "market_segment_type", "booking_status")
Among the given segment types, complimentary trips seldom if ever cancel, but targeting this segment is not straightforward as the company doesn't directly benefit financially from giving away trips. On the other hand, the online segment provides the greatest relative share of cancellations. The corporate segment shows the greatest potential as it pertains to risk aversion on the basis of cancellation.
Repeated Guest
against the target variable Booking Status
using the stacked_barplot function provided and write your insights:¶Important note: repeating guests are the guests who stay in the hotel often and are important to brand equity.
stacked_barplot(data, "repeated_guest", "booking_status")
Repeated guests obviously carry extreme importance both for brand equity and for reliability in avoiding trip cancellation, with an over 30% jump from non-repeats in cancellation probability.
Let's analyze the customer who stayed for at least a day at the hotel.
stay_data = data[(data["no_of_week_nights"] > 0) & (data["no_of_weekend_nights"] > 0)]
stay_data["total_days"] = (stay_data["no_of_week_nights"] + stay_data["no_of_weekend_nights"])
stacked_barplot(stay_data, "total_days", "booking_status",figsize=(15,6))
As hotel room prices are dynamic, Let's see how the prices vary across different months
plt.figure(figsize=(10, 5))
sns.lineplot(y=data["avg_price_per_room"], x=data["arrival_month"], ci=None)
plt.show()
Separating the independent variables (X) and the dependent variable (Y)
X = data.drop(["booking_status"], axis=1)
Y = data["booking_status"]
X = pd.get_dummies(X, drop_first=True) # Encoding the Categorical features
Splitting the data into a 70% train and 30% test set
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use the stratified sampling technique to ensure that relative class frequencies are approximately preserved in each train and validation fold.
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30,stratify=Y, random_state=1)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Both the cases are important as:
If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of distribution channels.
If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage brand equity.
F1 Score
to be maximized, the greater the F1 score, the higher the chances of minimizing False Negatives and False Positives. Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.
# Creating metric function
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize=(8,5))
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels=['Not Cancelled', 'Cancelled'], yticklabels=['Not Cancelled', 'Cancelled'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
We will be building 4 different models:
# Fitting logistic regression model
lg = LogisticRegression()
lg.fit(X_train, y_train)
# Checking the performance on the training data
y_pred_train = lg.predict(X_train)
metrics_score(y_train, y_pred_train)
We have created a predictive model capable of predicting cancellation with a precision of 74% and a recall score of 61%.
Let's check the performance on the test set:
# Checking the performance on the test dataset
y_pred_test = lg.predict(X_test)
metrics_score(y_test, y_pred_test)
The given model provides a relatively decent tradeoff between precision and recall (both > 60%) but using the precision-recall curve, we will be able to optimize these metrics.
Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
Let's use the Precision-Recall curve and see if we can find a better threshold.
# Predict_proba gives the probability of each observation belonging to each class
y_scores_lg=lg.predict_proba(X_train)
precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_train, y_scores_lg[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label='precision')
plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
We want to choose a threshold that has a high recall while also having a small drop in precision. We also want to keep some precision. A threshhold value of 0.3 ensure a precision above 60% and a recall around 80%.
# Setting the optimal threshold
optimal_threshold = 0.3
# Creating confusion matrix
y_pred_train = lg.predict_proba(X_train)
metrics_score(y_train, y_pred_train[:,1]>optimal_threshold)
The model performance has improved as compared to our initial model.The recall has increased by 20%.
y_pred_test = lg.predict_proba(X_test)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold)
Using the model with a threshold of 0.3, the model has achieved a recall of 78% i.e. increase of 18%. The precision has dropped compared to the inital model but using the optimial threshold, the model's performance is more balanced.
To accelerate SVM training, let's scale the data for support vector machines.
scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train_scaled = scaling.transform(X_train)
X_test_scaled = scaling.transform(X_test)
Let's build the models using the two of the widely used kernel functions:
Note that we are using the scaled data for modeling the SVM.
svm = SVC(kernel='linear',probability=True) # Linear kernal or linear decision boundary
model = svm.fit(X = X_train_scaled, y = y_train)
y_pred_train_svm = model.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)
This model did a decent job predicting cancellations with a recall score of 90%. Predictions on non-cancellations were much lower (61%).
Checking model performance on test set
y_pred_test_svm = model.predict(X_test_scaled)
metrics_score(y_test, y_pred_test_svm)
The performance from the training set seems to carry over to the test set, with similar benchmarks shown. We will now use the curve to optimize the precision-recall tradeoff.
# Predict on train data
y_scores_svm=model.predict_proba(X_train_scaled)
precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
A threshhold around 0.3 seems to maintain relatively high recall (0.8) while keeping precision above stable (0.6).
optimal_threshold_svm=0.3
y_pred_train_svm = model.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train_svm[:,1]>optimal_threshold_svm)
As suspected, the model presents a satisfactor recall improvement from 0.61 to 0.79 with a slight precision tradeoff of 0.63 down from 0.74.
y_pred_test = model.predict_proba(X_test_scaled)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold_svm)
svm_rbf=SVC(kernel='rbf',probability=True)
svm_rbf.fit(X_train_scaled,y_train)
y_pred_train_svm = svm_rbf.predict(X_train_scaled)
metrics_score(y_train, y_pred_train_svm)
When compared to the baseline svm model with linear kernel, the model's performance on training data has marginally improved using an RBF kernel from 0.70 to 0.71.
y_pred_test = svm_rbf.predict(X_test_scaled)
metrics_score(y_test, y_pred_test)
When compared to the baseline svm model with linear kernel, the recall score on testing data has increased from 61% to 63%.
# Predict on train data
y_scores_svm=svm_rbf.predict_proba(X_train_scaled)
precisions_svm, recalls_svm, thresholds_svm = precision_recall_curve(y_train, y_scores_svm[:,1])
# Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds_svm, precisions_svm[:-1], 'b--', label='precision')
plt.plot(thresholds_svm, recalls_svm[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
optimal_threshold_svm=0.19
y_pred_train_svm = model.predict_proba(X_train_scaled)
metrics_score(y_train, y_pred_train_svm[:,1]>optimal_threshold_svm)
y_pred_test = svm_rbf.predict_proba(X_test_scaled)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold_svm)
model_dt = DecisionTreeClassifier(random_state=1)
model_dt.fit(X_train, y_train)
# Checking performance on the training dataset:
pred_train_dt = model_dt.predict(X_train)
metrics_score(y_train, pred_train_dt)
pred_test_dt = model_dt.predict(X_test)
metrics_score(y_test, pred_test_dt)
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
}
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, cv=5,scoring='recall',n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
# Checking performance on the training dataset
dt_tuned = estimator.predict(X_train)
metrics_score(y_train,dt_tuned)
The decision tree in the training set seems to predict fine, but other models previously built have had greater recall at the cost of only slightly reduced precision.
# Checking performance on the training dataset
y_pred_tuned = estimator.predict(X_test)
metrics_score(y_test,y_pred_tuned)
feature_names = list(X_train.columns)
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
estimator,max_depth=3,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Importance of features in the tree building
importances = model_dt.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
rf_estimator = RandomForestClassifier( random_state = 1)
rf_estimator.fit(X_train, y_train)
y_pred_train_rf = rf_estimator.predict(X_train)
metrics_score(y_train, y_pred_train_rf)
y_pred_test_rf = rf_estimator.predict(X_test)
metrics_score(y_test, y_pred_test_rf)
importances = rf_estimator.feature_importances_
columns = X_train.columns
importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
sns.barplot(importance_df.Importance, importance_df.index,color="violet")