Exploring Employees’ Attrition (R)

Employee attrition is one of the challenges in human resource management. It refers to voluntary or involuntary loss of human capital through processes such as elimination of position, retirement, termination, resignation, employees’ health or personal problems and more. Although attrition is considered more friendly, but it has almost same adverse influence on organization as turnover does.

Attrition can cost 33% of an employee’s total compensation and also negatively impact employees moral and effectiveness. High staff turnover and attrition rate can risk can also damage organizations’ reputation and hence limit their access to labor market.  In addition, predicting employees’ attrition is necessary for strategic human resource planning and forecasting and preparing for possible talent scarcity.

Factors that may influence attrition are management, work-life conflict, nature of job (involving, engaging, meaningful,),  reward systems, development opportunities, skill match, trust, compensation and benefit, and more. While this metrics may from organization to organization, employees’ data, data science, analytical tools, and machine learning technology make it possible to predict employee attrition.

In this example, I explore IBM’s HR dataset to find patterns and common characteristics in different variables that may point to attrition.

IBM employee data is a fictional dataset and is publicly available and you can download the cvs file from  here.

Download the complete R script  from  my Github repository.

Data Exploration

First, install the required packages and libraries, and  import the IBM dataset to R studio using the following code. Then use the str() code to  see the structure of the data. 

 

install.packages("ggpubr")
install.packages("wesanderson")
library(wesanderson)
library(dslabs)
library(tidyverse)    
library(ggplot2)
library(dplyr)
library(GGally)
library(rpart)
library(rpart.plot)
library(ggpubr)
# Imort the Data
df=read.csv('~/Desktop/Rfiles/IBMdata.csv')
#  attach the dataset to access variables localy in R 
attach(df)
#explore the data structure
str(df) 

The result shows the IBM dataset is available as data frame with 1470 objects of 35 variables. Variable names, data types, number of levels in each factor, and some sample data are provided in the result.

The only problem here is that some of the variables that should be factors are imported as integers. Using the following code we can convert these variables to ordered factors. 

#convert int variables to ordered factors 
names <- c('RelationshipSatisfaction', 'PerformanceRating', 'WorkLifeBalance', 
           'JobInvolvement', 'JobSatisfaction', 'JobLevel')
df[,names] <- lapply(df[,names] , factor, ordered = TRUE)
str(df) 

Next, check whether there are any missing values in the dataset using the following code.

# explore missing values in each columns
sapply(df, function(x) sum(is.na (x))) 

As we can see there are no missing values in this dataset.   

One of the useful outputs of data visualization is that we can learn about the distribution of variables. For categorical data we can construct this distribution by simply computing the frequency of each unique value. This can be done with the function table using. For instance, to see the distribution of each levels in marital status  the following code will return that 327 of employees are divorced, 673 married, and 470 are single. 

#compute the frequencies of each unique values in Marital status column and save it into object maritaltab.

maritaltab <- table(df$MaritalStatus)
maritaltab 

 

>  maritaltab

    Divorced     Married         Single

            327             673             470 

 

Now let use the following code to create a table that shows sum of attrition for all employees.


#Create a bar plot for Attrition
df %>%
  group_by(Attrition) %>%
  tally() %>%
  ggplot(aes(x = Attrition, 
    y = n,fill=Attrition)) +
  geom_bar(stat = "identity") +
  theme_minimal()+
  labs(x="Attrition",  
    y="Count of Attriation")+
  ggtitle("Attrition")+
  geom_text(aes(label = n), 
    vjust = -0.5, 
    position = position_dodge(0.9))
    
     

The bar plot shows that from 1407 employees 1233 show attrition and only 237 of employee stay in organization. 

We can also check the distribution of a variable and its standard deviation with density plot. Moreover, we can compare the distribution of multiple variables in one density plot. For instance, we can  compare the dispersion of the attrition for each level of job involvement and based on years employees worked with their current manager, using the following code. 

# Define Involvement labels for each level 
df <- df %>%
  mutate(group = case_when(
    .$JobInvolvement == 1 ~ "Barely Involved",
    .$JobInvolvement == 2 ~ "Slightly Involved",
    .$JobInvolvement == 3 ~ "Involved",
    .$JobInvolvement == 4 ~ "Very Involved"))

# define years with current manager as variable for ploting
ywmngr_atr <- df %>%
  ggplot(aes(YearsWithCurrManager, fill = group)) +
  scale_x_continuous(trans = "log10") +
  ggtitle("Attrition Based on Year with Current Manager and Involvement") + theme_bw() + 
  theme(plot.title = element_text(hjust=0.5, face = "bold", colour = "#386CB0", size = 16),
        axis.text = element_text(face = "bold", size = 10),
        axis.title.x = element_text(face = "bold", size = 12),
        axis.title.y = element_text(face = "bold", size = 12))
# stacked density plot
ywmngr_atr + geom_density(alpha = 0.3, bw = 0.75, position = "stack") +
  facet_grid(Attrition ~ .)  

The plot shows the dispersion of attrition for both levels (Yes and No) around mean in two stacked plots. In each plot we can see the distribution lines for all four levels of employees job involvement relevant to the years employees spent with their current manager. Plots show that employees with attrition that are slightly and highly involved are dispersed in 1 standard deviation where they spent between 3 to 4 years with their current managers. And as the number of years goes higher the standard deviations show a subtle decrease. 

Let try another density plot to compare attrition based on monthly income and job satisfaction, using the following code. 

# Define Jobsatisfaction labels for each level 
df <- df %>%
  mutate(Job_Satisfaction = case_when(
    .$JobSatisfaction == 1 ~ "Not Satisfied",
    .$JobSatisfaction == 2 ~ "Slightly Satisfied",
    .$JobSatisfaction == 3 ~ "Satisfied",
    .$JobSatisfaction == 4 ~ "Very Satisfied"))
# define the variables for ploting
incm_atr <- df %>%
  ggplot(aes(MonthlyIncome, fill = Job_Satisfaction)) +
  scale_x_continuous(trans = "log2") + 
  scale_fill_manual(values=c("#D53E4F", "#ABDDA4", "#2166AC", "#FFFF33")) +
  ggtitle("Attrition Based on Monthly Income and Job Satisfaction") + theme_bw() + 
  theme(plot.title = element_text(hjust=0.5, face = "bold", colour = "#386CB0", size = 16),
        axis.text = element_text(face = "bold", size = 10),
        axis.title.x = element_text(face = "bold", size = 12),
        axis.title.y = element_text(face = "bold", size = 12)) +
  geom_vline(aes(xintercept= mean(MonthlyIncome)),
           linetype="dashed", color = "blue")
# stack density plots
incm_atr + geom_density(alpha = 0.3, bw = 0.75, position = "stack") +
  facet_grid(Attrition ~ .)  

Plots show that employees with attrition that are satisfied are dispersed between  1 to .65 standard deviation where their monthly income is around 3000. At the same range of income employees with slight job satisfaction are dense in areas between .65 to .3 of standard deviation. And the probability of attrition for very satisfied employees is significant at .3 and less. As the monthly income increases the probability of attrition goes higher for all groups.  This result is slightly different for employees with no attrition as the pick of income for 1 standard deviation start from around $5000 and the probability of showing no attrition is less significant as the income increases.

Now we want to see how attrition shows itself in each department and if business travel and distance from home have any effects on attrition. 

Use the following code to create the bar plots.

 

travel_atr <- ggplot(df, aes(BusinessTravel,fill=Attrition))+geom_bar()+scale_fill_manual(values=c("#2166AC", "#B3CDE3" ))
dep_atr <- ggplot(df, aes(Department,fill = Attrition))+geom_bar()+scale_fill_manual(values=c("#2166AC", "#B3CDE3" ))
dist_atr <- ggplot(df,aes(DistanceFromHome,fill=Attrition))+geom_bar()+scale_fill_manual(values=c("#2166AC", "#B3CDE3" ))

ggarrange(travel_atr, dep_atr, dist_atr + rremove("x.text"),
          labels = c("Business Travel", "Department", "Distance from Home"),
          ncol = 3, nrow = 1)
 

The plot shows that those employee that rarely travel have higher number of attrition. Attrition is also higher in research and development department and for those with minimum distance from home. 

Now we will try same bar plot to check the relationship of attrition with gender, education level, field of study, and environmental satisfaction using the following code.

eduPlot <- ggplot(df,aes(Education,fill=Attrition))+geom_bar()+scale_fill_manual(values=c("#BEAED4", "#FFFF99"))+theme_bw(base_family = "Times")
edufieldPlot <- ggplot(df,aes(EducationField,fill=Attrition))+geom_bar()+scale_fill_manual(values=c("#BEAED4", "#FFFF99"))+theme_bw(base_family = "Times")
envPlot <- ggplot(df,aes(EnvironmentSatisfaction,fill=Attrition))+geom_bar()+scale_fill_manual(values=c("#BEAED4", "#FFFF99"))+theme_bw(base_family = "Times")
genPlot <- ggplot(df,aes(Gender,fill=Attrition))+geom_bar()+scale_fill_manual(values=c("#BEAED4", "#FFFF99"))+theme_bw(base_family = "Times")
ggarrange(eduPlot,edufieldPlot,envPlot,genPlot + rremove("x.text"), 
          ncol = 2, nrow = 2) 

Use the following code to build two scatter plots to see the correlation between age and tenure for both male and female, and the relationship between monthly rate and year at company for employees with different marital status. 

#explore the relationship between age and years at company while comparing Gender
tnr_age <- ggplot(df, aes(Age, YearsAtCompany, shape = Gender, colour = Gender)) + geom_point(size=3) + theme_test() + 
  ggtitle("Age vs Tenure for Male and Female") + theme(plot.title = element_text(hjust=0.5, face = "bold", colour = "#386CB0", size = 16)) 
print(tnr_age + scale_shape_manual(values = c(18, 21)) + 
        scale_colour_manual(values = c("#7570B3", "#1B9E77")))
#explore the relationship between monthly rate and year at company while comparing maritalstatus
mrate_tnr <- ggplot(df, aes(MonthlyRate, YearsAtCompany, color = MaritalStatus)) + 
  scale_x_continuous(trans = "log10") + 
  geom_point(size = 2) + theme_test() + 
  ggtitle("Monthly Rate vs Tenure for Marital Status") + theme(plot.title = element_text(hjust=0.5, face = "bold", colour = "#386CB0", size = 16)) 
mrate_tnr
#arragne plots side by side
ggarrange(tnr_age, mrate_tnr + rremove("x.text"),
          ncol = 2, nrow = 1) 

The scatter plot to the right shows that there is no correlation between monthly rate and years at company. And the one on the left indicates that there is a relationship between age and year at company, specially for males. It also shows that he majority of data falls under 10 years at the company and bellow age 45. As ages go higher the number of points decrease for tenure.

However, the plot shows heteroscedasticity as the range of residuals increases. This results from predictors or target variable which is not normally distributed. To satisfy the regression assumptions and trust the result the residuals should have constant variance.

Finally, use the following code to create box plot to observe years at company for employees in each job role.

set_palette(p, "Set1")
p <- df %>%
  ggplot(aes(JobRole, YearsAtCompany, fill = JobRole)) + theme_bw() +  
  geom_boxplot() +
  ggtitle("Year at Company for Departments") + 
  theme(plot.title = element_text(hjust=0.5, face = "bold", colour = "#386CB0", size = 16)) +
  xlab("")
p 

In the plot above, boxes are defined by the 25th and 75th percentiles, and the whiskers are showing the range. The distance between these two percentiles shows the interquartile range,  the median is shown with a horizontal line, and outliers are those separate dots.

The box plot shows that the mean tenure for managers and research directors are higher than other job roles. And the maximum tenure in other roles is 10 years.  The plot also shows that there are outliers for all roles except  for manager and research director.  

 

Share this page

Share on twitter
Share on linkedin

Leave a Reply

Your email address will not be published.