Data Exploration in Databricks

Here, I am going to build a simple linear regression model in Databricks to predict employees tenure. For this project, I use a short version of IBM-HR data set that I previously downloaded from Kaggle . After cleaning the data I saved it to my computer as Employee data.  

Advantage of working with Databricks is that you can access your desired data from multiple sources. In  previous exercise,  I connected Databricks to a blob storage in Azure portal. Here, I will upload the dataset from my computer.

To start with the project login to the Databricks environment ➛ from the left side bar click on the Data ➛ Upload file ➛ Brows for the dataset ➛ Create Table with UI

After clicking on the Create Table with UI, you will be redirected to a new page where you can make changes to the table. 

Set the table name ➛ employeedata, leave the Create in database as default,  check the box next to the ‘first row is header’, and change the datatypes of all columns ➛ integer (INT).  ➛ Create table. 

Hitting the ‘create table’ button will redirect you to a new page where you can review the table schemas and sample data. 

Check the Shcema and if it looks fine,  from left side bar, go to your workspace and create a new notebook.  To set up the notebook, use the following code to import Python libraries, set some formatting for visualizations,  read the employee data, and import it as Pandas Dataframe. 

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from scipy import stats

from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

import warnings
from sklearn.exceptions import DataConversionWarning

#high quality display
%config InlineBackend.figure_format = 'retina'

#Set colors and fonts'seaborn-colorblind')
plt.rcParams['axes.axisbelow'] = True
mpl.rcParams['axes.titlesize'] = 18
mpl.rcParams['axes.labelsize'] = 14
mpl.rcParams['xtick.labelsize'] = 8
mpl.rcParams['ytick.labelsize'] = 8
mpl.rcParams['font.size'] = 14  
mpl.rcParams['legend.fontsize'] = 13
#only see 2 decimal
pd.set_option('precision', 2)

# import the data and use it as pandas dataframe
df1 = spark.sql("Select * from employeedata")
df = df1.toPandas()

Check the data dimensions and the data types using the following syntaxes.  

#to see the dataframe dimensions
print("df dimensions: {}".format(df.shape))
#to look at the data types

cor_ob = df[['YearsAtCompany', 'MonthlyIncome', 'TotalWorkingYears']]

The results bellow shows that  my dataset has 10 columns and 1470 rows.  The  output for command 4 shows that datatypes for all columns are integer. However, except age, year at company, year at current role, total working years, and monthly income the rest of the columns are categorical. 

Use the following code to change the data types of the columns department, job involvement, gender, and job satisfaction to category. 

df['JobInvolvement'] = df['JobInvolvement'].astype('category', copy=False)

df['Gender'] = df['Gender'].astype('category', copy=False)

df['JobSatisfaction'] = df['JobSatisfaction'].astype('category', copy=False)
df['Department'] = df['Department'].astype('category', copy=False)

The output below shows that the data types has changed to category.

Set the employee number to index.

Let see if there is a relationship between age, monthly income, and total working years.  Here, I expect a positive linear relationship between the variables.  My assumption is that as the age, total work experience, and income increase, so will increase employees tenure (year at company).  

Use the following syntax to build a pair plot with Seaborn library.

sns.pairplot(df, vars=['Age', 'MonthlyIncome',  
                       'TotalWorkingYears', 'YearsAtCompany'], hue='Gender')

The plot shows a positive relationship between age, total working years, and year at company and this relationship is more visible in green dots that represent female employees.  However, the relationship between total working years and year at company can be biased because they both can be pointing to a single measure. That is, it is possible that  total working years of some employees can be equal to the total year at company because the only work experience they have is with the same company. 

Now, lest use the following code to plot the possible correlations between monthly income, years at company,  and total working years. 

 cor_ob = df[['YearsAtCompany', 'MonthlyIncome', 'TotalWorkingYears']]  
 corrmat = cor_ob.corr()  
 f, ax = plt.subplots(figsize=(10, 6))  
 sns.heatmap(corrmat, vmax=.8, square=True, annot=True, fmt='.2f', cmap = 'viridis')

The output of the correlation heat map shows that there is a correlation between monthly income and total working year at 0.77.  However this correlation is not relevant to our question. This correlation make sense because employees with more work experience can earn more in the company.  The correlation between year at company and the other variables  is not significant. 

Now, let explore the role of gender one more time.  

 f, ax = plt.subplots()  
 sns.stripplot(x= 'Gender', y= 'YearsAtCompany', data=df, jitter=True)

The plot shows that Men (1) have  higher years  at company. It also shows some outliers in data  for females that might bias the analysis.

Next, check if there is a pattern for job satisfaction and employee tenure.  

 f, ax = plt.subplots()  
 sns.stripplot(x= 'JobSatisfacion', y= 'YearsAtCompany', data=df, jitter=True)

The output does not show any specific pattern for job satisfaction. 

Next, let try if working in any of the departments make a difference in tenure. 

 f, ax = plt.subplots()  
 sns.violinplot(x= 'Department', y= 'YearsAtCompany', data=df, palette='Set3')

The violin plot shows that the employees tenure in each department is  distributed in almost the same mean and interquartile ranges. 

Now, let look at the job involvement and tenure.  

 f, ax = plt.subplots()  
 sns.swarmplot(x= 'Jobinvolvement', y= 'YearsAtCompany', data=df)

The swarm plot shows that employees with job involvement rate of 3  have higher tenure within the company and those rated 2 have second more tenure. It is interesting that employees with highest job involvement have less distribution at the mean and have shorter year at company.

The information provided by these visualizations indicates that we can predict the year at company using these variables. However, the prediction might not be 100% accurate. 

Lets build a regression model to predict the year at the company based on all variables. 

But first we mot convert the categorical values to indicators.  

 cat_columns = ['JobInvolvement', 'Department', 'Gender', 'JobSatisfaction']   
 df = pd.get_dummies(df, prefix_sep="_", columns=cat_columns)

The output of command 16 shows that all four variables have been converted to indicator and each class was added to the data frame as a separate column.

Next, split the data to the test and training  sets, create a linear regression model, and train the model. 

X = df.drop('YearsAtCompany', axis=1)  
y = df['YearsAtCompany'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=40)  
print(len(X), len(y_train), len(X_test)) 

# Create linear regression object 
regr = linear_model.LinearRegression() 
# Train the model using the training sets, y_train) 
# Make predictions using the testing set 
y_pred = regr.predict(X_test) 

# Print coefficients and errors 
print('Coefficients: \n', regr.coef_)  
print('Mean squared error: %.2f' 
      % mean_squared_error(y_test, y_pred))  
print('Coefficient of determination: %.2f' 
      % r2_score(y_test, y_pred))  

The output shows the coefficients for each variable, mean squared error, and coefficient of determination. 

Mean Squared Error is measure of the differences between values predicted by the model  and the values actually observed. It represents the sample standard deviation of the differences between predicted values and observed values. My MSR is 12.94 which is pretty high. It means that the model is not precise and reliable.  

Coefficient of Determination (R2) summarizes the explanatory power of the regression model and is computed from the sums-of-squares terms. R2 describes the proportion of variance of the dependent variable explained by the regression model. If the regression model is perfect, then R2  must be 1.  The R2 of my model is 0.66 which means it can predict  with 66% accuracy. 

The output also shows the coefficients of each columns but it just shows the values. In multivariable linear regression, the model has to find the most optimal coefficients for all the attributes. 

To understand which variable has the most power in prediction we must see the values with the name of the columns next to them.  Let create a table that shows both column names and their coefficients. 

The coefficient table shows that age and monthly income are the most power variables for this prediction.  To improve the model, build a new regression model based on these two variables, and evaluate the model using 5 fold cross validation. 

#shuffle the data   
df = df.sample(frac=1) 
features_to_use = ['Age', 'MonthlyIncome']  
x2D = np.array(df[ features_to_use ]) 
y2D = np.array(df['YearsAtCompany']) 

model = linear_model.LinearRegression() 
cv_results = cross_validate(model, X, y, scoring=('r2', 'neg_mean_squared_error', 
                                                 'neg_mean_absolute_error'), cv=5) 
MAE = -cv_results['test_neg_mean_absolute_error'] 
RMSE = np.sqrt(-cv_results['test_neg_mean_squared_error']) 
R2 = cv_results['test_r2'] 

print('MAE:', MAE) 
print('R2;', R2)  
print('RMSE:', RMSE) 
print('Average MAE:\t {:.2f} (+/- {:2f})'.format(MAE.mean(), MAE.std())) print('Average RMSE:\t {:.2f} (+/- {:2f})'.format(RMSE.mean(), RMSE.std())) print('Average R^2:\t {:.2f} (+/- {:2f})'.format(R2.mean(), R2.std()))

The result shows the MAE, RMSE, and  R2 for each set of evaluation (5 tests).  It also shows the average of each values.  The values indicate that the model did not improve much and can predict with 68% accuracy. 

Share this page

Share on twitter
Share on linkedin

Leave a Reply

Your email address will not be published.