In this exercise, I build supervised machine learning models to predict employees’ attrition using IBM’s HR dataset. IBM employee data is a fictional dataset and is publicly availablehere.

Supervised machine learning is used when the correct outputs (prediction) are available. In this case the prediction is employees’ attrition, and during training the algorithms will search for patterns in the data (features) that correlate with the desired outputs (label). You can learn more about supervised machine learning at Microsoft Learn.

To predict which employees will leave the company (Yes attrition) or stay with the organization (No attrition), I created several binary (two-class) algorithms and chose the best model with the highest accuracy.

Algorithms: decision tree, random forest classifier, GLM, XGB, Two Class Support Vector Machine, Naive Bayes, and knn.

Preprocessing

First, install required packages, load libraries, and import the data, using the following code in R studio.

#Instal Packages and Libraries
install.packages("kknn", 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.6/kknn_1.3.1.tgz')
yinstall.packages("igraph", 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.6/igraph_1.2.5.tgz')
install.packages(xgboost, 'https://cran.rstudio.com/src/contrib/xgboost_1.0.0.2.tar.gz')
install.packages("kernlab", "DMwR", "caret", "pROC")
library(caTools)
library(rpart)
library(caret)
library(data.table)
library(dplyr)
library(DT)
library(gridExtra)
library(ggplot2)
library(caret)
library(Metrics)
library(randomForest)
library(pROC)
library(e1071)
library(DMwR)
#Import the dataset
df=read.csv('~/Desktop/Rfiles/IBMdata.csv')
#Attach the dataset to access variables localy in R
attach(df)
#explore the data structure
str(df)

The function str (last line of the code above) is useful for finding out about the structure of our dataset.

The result shows the IBM dataset is available as data frame with 1470 objects of 35 variables. Variable names, data types, number of levels in each factor, and some sample data are provided in the list.

This out put indicates that we need to perform some feature engineering before building our machine learning algorithms. For instance, some categorical variables are imported as integer which must be converted to factors. We can also remove some of the unnecessary variables from the data frame, and create some new features that may improve our prediction.

Feature Engineering

Coerce the integer variables into factors. Since these features are scaled variables which contain ordered categories, we define them as being ordered suing the following code.

The result shows that all seven variables are coerced into ordered factors.

Now let us create two new variables to record the average of employees’ tenure for their past jobs. Employees have worked for several companies and for specific period of time. We have the information about total working years, number of companies worked, and years at current company. Using these values, we can write a function that calculates average tenure of employees per each company.

Next, convert some of integers to category by binning the values using the variable summaries: Min, 1^{st} Qu, Median , Mean, 3^{rd} Qu, and Max.

Now, create a new data frame objects (df_cat) to save all categorical features which will be used for building the algorithms.

Finally, check to make sure that there are no missing values in the data set.

#create a new data frame with factors
df_cat = df[, c(2, 3, 5, 7, 8, 11, 12, 14, 15, 16, 17, 18, 23, 25, 26, 28, 31, 37, 38, 39, 40, 41)]
#check for missing values
colSums(is.na(df))

The result shows that the sum of missing values in each column is zero. Thus, the dataset is ready for building the models.

Split the Data

A machine learning algorithm is evaluated on how it performs in the real world with completely new datasets. However, in supervised methods, we have a dataset for which we know the outcomes, as we do with the attrition. Therefore, to mimic the ultimate evaluation process, we randomly split the data into training and test sets. The caret package includes the function that helps us split the data into 70% for training and 30% for testing.

#Split the Data
set.seed(010)
df_split <- sample.split(df_cat$Attrition, SplitRatio = 0.70)
df_cat_train <- subset(df_cat,df_split == T)
df_cat_test <- subset(df_cat,df_split == F)
# compare the dimention of splitted dataset
dim(df_cat)
dim(df_cat_test)
dim(df_cat_train)

> # Compare the dimensions

> dim(df_cat)

[1] 1470 33

> dim(df_test)

[1] 441 33

> dim(df_train)

[1] 1029 33

The dim() function above shows the proportion of the data in train and test sets. The result on the right shows that the test data consist of 441 rows (30%) and the train set contains 1029 rows (70%).

Build the Models

After splitting the data, we will begin by developing algorithms using only the training set and then evaluate them with the test set. To predict attrition, we must use classification approaches that help to predict each observation belongs to which occurring class. We begin with some tree-based methods such as decision tree and random forest and continue with support vector machine, XGB, KNN, and Naive Bayes.

Decision Tree

Decision tree involves segmenting the predictors into number of groups and set of rules will be used to predict the expected response. These segmentation rules can be summarized in a tree. For decision tree we will use the df_cat dataframe and build the algorithm and plot the decision tree using the following code.

#fit the model
DTModel <- rpart(formula = Attrition ~., data=df_cat_train)
#print the rules
rpart.rules(DTModel, style = "tall")
#plot the model and branches
prp(DTModel, type = 3, clip.right.labs = TRUE, box.palette="BlGnYl", branch = .5, under = FALSE)

The decision tree displays the leaves and at the end of each node, obtains a predictor for attrition and shows the probability of the fitted class. The leaf nodes are positioned at the bottom of the graph. In interpreting the results of a classification tree, we are interested in the class prediction corresponding to a particular terminal node region, and the class proportions among the training observations that fall into that region.

Validate models’ accuracy on test dataset, print the confusion matrix, and plot the ROC curve suing the following code.

#validation on test set
DTModel_pred = predict(DTModel, newdata = df_cat_test, type = 'class')
# confusion matrix
DTM_CM <- confusionMatrix(DTModel_pred, df_cat_test$Attrition)
DTM_CM
#plot the ROC curve
DTM_ROC <- plot.roc(as.numeric(df_cat_test$Attrition), as.numeric(DTModel_pred),lwd=4, type="b",grid.lty=3, grid=TRUE, print.auc=TRUE,print.auc.col= "#1B9E77", col ="#1B9E77")

The result of our validation shows the confusion matrix on the left side and the ROC curve on the right. A good test will have minimum numbers of false positive (FP) and false negative (FN). In current test FP is 50 and FN is 21. Other metrics that are important to notice from the confusion matrix are accuracy, sensitivity, specificity, and prevalence. This model predicted attrition with 85% accuracy. Sensitivity describes what proportion of employees are correctly identified as having no attrition and 0.96 shows that 96% of employees with attrition are identified correctly. Specificity describes the proportion of employees that are correctly identified as having attrition and if the value is low as it is in our model 0.29 it represents false positives. Finally, prevalence shows that how often the yes condition actually occurs in our sample which here is 0.83. The plot on the left is the ROC curve. It shows the point along the curve that gets us closest to the ideal model in the upper left corner has a threshold of 0.3 (the intersect of lowest specificity and highest sensitivity). The maximum value for the measure of the Area Under the Curve (AUC) is 1. The AUC of this model is 0.63 and therefore the model is not extremely good.

Random Forest

The goal of random forest algorithm is to improve prediction performance by averaging multiple decision trees and creating a forest of trees constructed randomly. It is done through bootstrap aggregation and generating many predictors using classification (or regression) trees. The final prediction is formed based on the average prediction of all these trees. The bootstrap is to ensure randomness and that the individual trees are not the same. The combination of trees is the forest and number of trees in current model are 50. This ensure that every input row gets predicted at least a few times. Use the following code to fit the model, predict, and plot the result.

#fit the random forest classification to the training set
RFModel = randomForest(x = df_cat_train[-1],y = df_cat_train$Attrition, ntree = 50, nodesize = 1, importance = TRUE)
#Plot the prediction
plot(RFModel)
RFModel
#predict on the test set
RF_pred = predict(RFModel,newdata = df_cat_test[-1],type="response")
#Build the confucion matrix
RFM_CM<-confusionMatrix(RF_pred, df_cat_test$Attrition)
RFM_CM
#Plot the ROC curve
RFModel_ROC <- plot.roc(as.numeric(df_cat_test$Attrition), as.numeric(RF_pred),lwd=4, type="b",grid.lty=3, grid=TRUE, print.auc=TRUE,print.auc.col= "#FF7F00", col ="#377EB8", main ="Tenure Based on Department")

Results from fitting the random forests for 50 trees in the above image (left plot) displays the test error as a function of the number of trees. Each colored line corresponds to a different value of the number of predictors available for splitting at each interior tree node. The green line shows the error rate of TN and FP around 0.78, the red line represent the error rates of false negatives FN and true positives TP as 0.02, and the black line shows the values of thresholds 0.2. The plot on the right is the result of validation on the test set. It shows the ROC curve, AUC 0.58, and the threshold at intersect of sensitivity 0.2 and specificity 1.0.

eXtreme Gradient Boosting

XGB is designed to push the model to the extreme to improve the accuracy of the prediction and overcome overfitting. This method uses more accurate approximations to find the best tree model. Here, we use cross validation function of XGB which randomly partitions the original sample into n fold equal size subsamples. The subsamples are used as training data. Use the following code to build the model.

#Set seeds
seeds <- set.seed(010)
#Define control parameters for train function
fitControl <- trainControl(method="cv", number = 10, classProbs = TRUE, seeds = seeds)
#Build the model and predict
XGBModel <- train(Attrition~., data = df_cat_train, method = "xgbTree",
trControl = fitControl)
XGBM_pred <- predict(XGBModel,df_cat_test)
XGBM_CM <- confusionMatrix(XGBM_pred, df_cat_test$Attrition)
#Plot
XGBModel_ROC <- plot.roc (as.numeric(df_cat_test$Attrition), as.numeric(XGBM_pred),lwd=3,
type="b",grid.lty=2, grid=TRUE, print.auc=TRUE,print.auc.col= "#1B9E77", col ="#1B9E77", main ="eXtreme Gradient Boosting")

Support Vector Machine

Support vector machine (SVM) is an approach for classification that have been shown to perform well in a variety of settings and are often considered one of the best out of the box classifiers. SVM is a further extension of the support vector classifier which accommodate non-linear class boundaries and is intended for the binary classification setting in which there are two classes. Below are the code to build the model, the confusion matrix, and ROC curve.

Now, we will fit a logistic regression method. The GLM function fits generalized linear models which is a class of models that includes logistic regression. In the syntax we must pass in the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

#fitting the GLM model to the training set
GLMModel <- train(Attrition~.,df_cat_train,method = 'glm', family=binomial, trControl = trainControl(method = 'repeatedcv',number = 3))
#predicting on the test set
GLM_pred <- predict(GLMModel, df_cat_test)
#Making the confucion matrix
GLM_CM <- confusionMatrix(df_cat_test$Attrition, GLM_pred)
GLM_CM
#Plot the ROC curve
GLM_ROC <- plot.roc(as.numeric(df_cat_test$Attrition), as.numeric(GLM_pred),lwd=4, type="b",grid.lty=3, grid=TRUE, print.auc=TRUE,print.auc.col= "#386CB0", col ="#F0027F", main ="GLM Model")
GLM_ROC

Naive Bayes Classifier

Bayes classifier computes the conditional a-posterior probabilities of a categorical class variable given independent predictor variables using the Bayes rule. The standard naive Bayes classifier assumes independence of the predictor variables, and Gaussian distribution (given the target class) of predictors.

#Fit the model on train data
NBModel <- naiveBayes(Attrition~., data=df_cat_train)
#Validate on test set
NBM_pred = predict(NBModel, newdata = df_cat_test)
#Print confusion matrix
NBM_CM<-confusionMatrix(NBM_pred, df_cat_test$Attrition)
NBM_CM
#Plot
NBM_ROC <- plot.roc(as.numeric(df_cat_test$Attrition), as.numeric(NBM_pred),lwd=4, type="b",grid.lty=3, grid=TRUE, print.auc=TRUE,print.auc.col= "#FF7F00", col ="#FF7F00", main ="Naive Bayes")
NBM_ROC

K- Nearest Neighbor

The KNN classifier predicts the class of a given test observation by identifying the observations that are nearest to it. In k-nearest neighbor classification, for each row of the dataset, the k nearest vectors are found, and the classification is decided by majority vote, with ties broken at random. If there are ties for the k nearest vector, all candidates are included in the vote.

#Fit the model on train data
KnnModel <- train(Attrition~.,df_cat_train,method = 'knn',trControl = trainControl(method = 'repeatedcv',number = 3))
#Predict with test set
Knn_pred <- predict(KnnModel,df_cat_test)
#Print confusion matrix
Knn_CM <- confusionMatrix(df_cat_test$Attrition, Knn_pred)
#Plot the curve
plot.roc(as.numeric(df_cat_test$Attrition), as.numeric(Knn_pred),lwd=4, type="b",grid.lty=3, grid=TRUE, print.auc=TRUE,print.auc.col= "#BEBADA", col ="#BEBADA", main = "K-Nearest Neighbor")

Compare Models Performance

Now, it is time to compare all the statistics that was generated by our classification models. We can create a single plot that combines all the ROC curves and choose the best threshold value for our model. It is useful to also create a table from all statistics of confusion matrixes and compare them in detail. We start with creating a plot for all ROC curves using the following code.

The plot provides a general visualization to compare the curves that we have already seen in previous steps. The best model is the one with minimum specificity and maximum sensitivity. In our case, Naive Bayes and GLM have the highest sensitivity, but also have the highest specificity. It seems that the best threshold for our model will be 0.3. Other than that it does not provide any additional information, and choosing the best model solely based on the image is not easy and reliable. To get a better understanding we can extract the metrics from our confusion matrix and compare the values.

To access the statistics in confusion matrixes we simply can pick one of the matrixes and run the function str() to explore the measures and use the path in our code to access and print the values.

The following block of code helps to first extract desired metrics from confusion matrixes, create a data frame from the extracted values, and create a grid table to explore the data.

The table above contains important statistics for each model. Some of the variables are familiar as we have seen them in previous confusion matrixes like sensitivity, specificity, and accuracy. The new values driven from last block of code are:

Balanced accuracy is the balanced percentage of total items classified correctly (TP+TN)/(N+P). It is useful when the dataset is not balanced, and the numbers of trials in each class and cross-validation folds are not equal. The data exploration in previous post Exploring Employees’ Attrition indicates that the factor attrition contains 237 instances in the Yes class and 1233 observations in No class. This imbalance suggest that balanced accuracy is more appropriate to evaluate the performance of the models.

Recall and sensitivity are same metrics. They represent the number of items correctly identified as positive out of total true positives- TP/(TP+FN). Recall should be the model metric to consider when cost associated with false negative is high.

Precision is a measure that helps to determine if the costs of false positive is high. It calculates the number of items correctly identified as positive out of total items identified as positive- TP/(TP+FP)

F1 Score It is a harmonic mean of precision and recall and is a better measure for imbalanced classes in dataset.

So, according to table above the best scores per model are:

Specificity: DTM, RFM, XGBM = 0.296

Sensitivity: RFM = 0.978

Balanced Accuracy : SVM = 0.79

F1_Score : GLM = 0.927

Precision: SVM = 0.99

The following comparative bar plot will visualize these values clearly.

Results

Remember that we choose the model depending on the context. Some types of errors are more costly than others. We have to decide if it is much more important to maximize sensitivity over specificity, or vice versa. For this example, we simply prefer the model with the highest F1 score which is the harmonic average of precision and recall. Thus, the best model is GLM with F1 score of 0.92. It is important to notice that our models can be improved by balancing and scaling the dataset, including all features in the model or removing those that may not be pertinent to the prediction, dealing with the outliers, and checking for potential overfitting.