Hey there! It’s been a while since muy previous update. In that update, I mentioned the fact that
I relocated to Seattle. I accepted a position as a Data Analyst/ BI Engineer. Fortunately, this position allowed
me to quickly adopt Python, and become more fluent and agile with Pandas in the context of Data Analytics.
I still prefer R, but it’s hard to ignore all of the cool and exciting libraries and frameworks available for Python.
This has lead me to become determined in increasing my fluency in Python.
Recently, I took a course on Datacamp that detailed utilizing the sci-kit learn package for machine learning classification
and regression. I’m not new to machine learning. In fact, you can find a few articles in this blog on that topic.
In R my tool of choice was the Caret library. However, after completing the course I was surprised to see that a
large portion of the pre-processing, model selection and tuning is just as easy in sci-kit learn as it is in Caret.
I thought I’d use my new skills in sci-kit learn, by outlining how to train a machine learning
classification model on some interesting data.
To me, the most interesting data is data that is impactful and can drive some type of change.
Recently, I came accross some data on Kaggle for Rev.com’s transcribing service.
This dataset contained several metrics measuring job level attributes of customers and agents.
In this article, we’ll identify pertinent variables and measurements that will allow us to predict whether
(if an agent accepts or chooses the job) an agent will meet or exceed the deadline set
by the customer. This can allow rev.com to emphasize particular jobs to agents they think will be able to meet the job’s deadline.
We’ll do this by
Introducing the data set, checking for nulls and cleanliness.
Performing some EDA.
Carrying out a bit of Feature Engineering.
Training and Testing two models utilizing GLM & Random Forest.
A table providing meta-data about the dataset, as well as the table is printed below.
This dataset is essentailly a log of 20 thousand jobs fulfilled by rev.com agents (reffered to as revvers).
Each job record (row) contains the JobID, data about when the Job was started, completed, it’s expected duration, and when the job was due.
There is also some agent-level data like the agent’s rev.com level when the job was accepted, and the particular agent’s ID.
The only customer data is the ID of the customer that submitted the job. Finally we have two metrics of interest, the elapsed amount of time in
seconds after the deadline (negative for jobs fulfilled within the deadline) and whether it was fulfilled within the specified deadline.
rev_df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 20022 entries, 0 to 20021
## Data columns (total 25 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 JobId 20022 non-null int64
## 1 JobStartedOnTS 20022 non-null datetime64[ns]
## 2 JobStartedOnDay 20022 non-null int64
## 3 JobStartedOnWeekday 20022 non-null int64
## 4 JobStartedOnHour 20022 non-null int64
## 5 JobStartedOnMinute 20022 non-null int64
## 6 JobCompletedOnTS 20022 non-null datetime64[ns]
## 7 JobCompletedOnDay 20022 non-null int64
## 8 JobCompletedOnWeekday 20022 non-null int64
## 9 JobCompletedOnHour 20022 non-null int64
## 10 JobCompletedOnMinute 20022 non-null int64
## 11 JobDueOnTS 20022 non-null datetime64[ns]
## 12 JobDueOnWeekday 20022 non-null int64
## 13 JobDueOnDay 20022 non-null int64
## 14 JobDueOnHour 20022 non-null int64
## 15 JobDueOnMinute 20022 non-null int64
## 16 AgentLevelWhenJobClaimed 20022 non-null int64
## 17 ProjectLengthSeconds 20022 non-null int64
## 18 ProjectLengthSegment 20022 non-null int64
## 19 CustomerId 20022 non-null int64
## 20 AgentId 20022 non-null int64
## 21 ExpectedJobDurationSeconds 20022 non-null int64
## 22 ActualJobDurationSeconds 20022 non-null int64
## 23 ElapsedSecondsAfterDueOn 20022 non-null int64
## 24 DeadlineRespected 20022 non-null bool
## dtypes: bool(1), datetime64[ns](3), int64(21)
## memory usage: 3.7 MB
Fortunately, our dataset does not contain any nulls and appears to be pretty clean.
We can get a better idea of the dataset by printing out the dataset header.
## JobId JobStartedOnTS JobStartedOnDay JobStartedOnWeekday
## 1 1801338 2016-01-07 13:48:00 7 5
## 2 1793811 2016-01-06 05:58:00 6 4
## 3 1813562 2016-01-11 11:50:00 11 2
## 4 1807190 2016-01-08 17:48:00 9 7
## 5 1865366 2016-01-23 14:09:00 23 7
## 6 1828095 2016-01-14 09:41:00 14 5
## JobStartedOnHour JobStartedOnMinute JobCompletedOnTS JobCompletedOnDay
## 1 21 48 2016-01-09 12:34:00 9
## 2 13 58 2016-01-06 06:43:00 6
## 3 19 50 2016-01-12 06:04:00 12
## 4 1 48 2016-01-14 08:10:00 14
## 5 22 9 2016-01-23 14:09:00 23
## 6 17 41 2016-01-14 12:53:00 14
## JobCompletedOnWeekday JobCompletedOnHour JobCompletedOnMinute
## 1 7 20 34
## 2 4 14 43
## 3 3 14 4
## 4 5 16 10
## 5 7 22 9
## 6 5 20 53
## JobDueOnTS JobDueOnWeekday JobDueOnDay JobDueOnHour JobDueOnMinute
## 1 2016-01-09 15:58:00 2 9 23 58
## 2 2016-01-06 07:28:00 4 6 15 28
## 3 2016-01-13 19:20:00 4 14 3 20
## 4 2016-01-14 04:58:00 4 14 12 58
## 5 2016-01-23 18:39:00 4 24 2 39
## 6 2016-01-14 20:51:00 4 15 4 51
## AgentLevelWhenJobClaimed ProjectLengthSeconds ProjectLengthSegment CustomerId
## 1 1 5340 5 140440
## 2 2 180 1 159507
## 3 2 6300 5 185608
## 4 1 14520 6 75334
## 5 2 720 3 191231
## 6 2 1920 4 206837
## AgentId ExpectedJobDurationSeconds ActualJobDurationSeconds
## 1 194641 180600 81960
## 2 170003 5400 2700
## 3 105819 199800 65640
## 4 198256 472200 51720
## 5 190339 16200 0
## 6 101595 40200 11520
## ElapsedSecondsAfterDueOn DeadlineRespected
## 1 -12240 TRUE
## 2 -2700 TRUE
## 3 -134160 TRUE
## 4 11520 FALSE
## 5 -16200 TRUE
## 6 -28680 TRUE
Before we jump into any kind of ML training we should look at the response variable proportions. This will give us some general overview as to how balanced our response variable is.
deadline_counts_df = rev_df.groupby(["DeadlineRespected"], as_index = False).\
agg(count = pd.NamedAgg("DeadlineRespected", "count"))
deadline_counts_fig = ggplot(deadline_counts_df, aes(x = "DeadlineRespected", y = "count", fill = "DeadlineRespected")) +\
geom_col() +\
theme_minimal()
deadline_counts_fig.draw()
deadline_counts_df
## DeadlineRespected count
## 0 False 914
## 1 True 19108
Yikes, our dataset is extremely unbalanced. We’ll have to make sure we address this later by stratifying our CV splits,
as well as our training and test set to account for this.
Next, we can take a look at the AgentLevel of revvers that failed to meet deadlines.
agent_level_counts_df = rev_df.\
groupby(["AgentLevelWhenJobClaimed", "DeadlineRespected"], as_index = False).\
agg(count = pd.NamedAgg("AgentLevelWhenJobClaimed", "count"))
#agent_level_counts_df = agent_level_counts_df.assign(AgentLevelWhenJobClaimed = lambda x: np.where(x["AgentLevelWhenJobClaimed"] == 0, "Zero", np.where(x["AgentLevelWhenJobClaimed"] == 1, "One", "Two")))
agent_level_deadline_fig = ggplot(agent_level_counts_df,
aes(x = "DeadlineRespected",
y = "count",
fill = "AgentLevelWhenJobClaimed")) +\
geom_col(position = "fill")
agent_level_deadline_fig.draw()
It seems like agents in the bottom two revver levels account for more of the revvers that failed to meet deadlines, but it doesn’t seem like it’s significantly more. We might also be interested in the the distributions of the expected job durations for jobs that resulted in met and exceeded deadlines. We’ll remove all outliers in this visualization, as some extreme outliers skew the figure, effectively compressing the boxplots.
deadline_exp_job_duration_fig = ggplot(rev_df, aes(y = "ExpectedJobDurationSeconds", x= "DeadlineRespected")) +\
geom_boxplot(outlier_shape = "") +\
ylim(0,20e4)
deadline_exp_job_duration_fig.draw()
This is interesting. It seems like the distribution of jobs that resulted in missed deadlines skew larger than the jobs where deadlines were met. This could be due to revvers underestimating the amount of time it would take to complete a deadline. All assumptions aside, this is an interesting metric we should include in our ML model.
Unfortunately, outside of our response variable, and a couple of explanatory variables not much of our dataset can be directly used in an ML model as-is. This is where we have to get a little creative, and engineer some explanatory variables utilizing the data we do have. This is a classic real-world example. When you obtain a raw data-set it is rarely (if ever) plug-and-play ready for training a ML model.
First, let’s take a step back and re-examine variables we have, so we can determine what kind of features we can create utilizing them. Since our dataset is essentially a log of jobs fulfilled by revvers, we can extract not only the number of jobs a revver has completed, but the number of jobs they’ve claimed which exceeded deadlines. Furthermore, we have the the revver’s record of prior time difference between deadline and job submissions, as well as their prior history of estimated job durations for jobs they have claimed in the past. We can use all of these metrics to calculate some measure of deadline delinquency and job length ‘over-extention’ by measuring the difference between the current accepted job, and previous job durations.
Let’s get started, first we’ll feature engineer the count of previous jobs fulfilled by the revver, and the count of previous jobs where they failed to meet the deadline:
### Count number of previous jobs completed by Rev Agent.
rev_df["NPrevJob"] = rev_df.\
sort_values(["AgentId", "JobStartedOnTS"]).\
groupby(["AgentId"], as_index = False).cumcount()
### Count number of previous jobs where Agent missed deadline.
rev_df = rev_df.\
assign(NOverDeadline =\
lambda x: np.where(x["DeadlineRespected"] == False, 1, 0))
rev_df["NOverDeadline"] = rev_df.\
sort_values(["AgentId", "JobStartedOnTS"]).\
groupby(["AgentId"], as_index = False)["NOverDeadline"].\
cumsum()
rev_df = rev_df.\
assign(NOverDeadline = lambda x:
np.where(x["DeadlineRespected"] == False,
x["NOverDeadline"] - 1,
x["NOverDeadline"]))
rev_df = rev_df.sort_values(["AgentId", "JobStartedOnTS"]).reset_index(drop = True)
Let’s print a trimmed header of the result, if we focus on a particular revver we can see if our logic is correct.
## JobStartedOnTS JobId AgentId CustomerId DeadlineRespected NPrevJob
## 1 2016-01-03 14:13:00 1782287 27006 196251 TRUE 0
## 2 2016-01-06 13:46:00 1795818 27006 75334 FALSE 1
## 3 2016-01-17 19:18:00 1839780 27006 170329 FALSE 2
## 4 2016-01-19 16:00:00 1846606 27006 170329 TRUE 3
## 5 2016-01-22 18:26:00 1862726 27006 170329 FALSE 4
## NOverDeadline
## 1 0
## 2 0
## 3 1
## 4 2
## 5 2
Awesome! For the revver with AgentId 27006, we see that the variables NPrevJob, and NOverDeadline display the correct results.
Let’s take a look at the two features we just created, and how they interact with jobs that end in missed deadlines.
We can plot a basic barchart using “NOverDeadline”, and see how many revvers with
missed deadlines have a record of previously missing deadlines.
deadlines_missed_counts_df = rev_df[rev_df["DeadlineRespected"] == False].\
assign(NOverDeadline = lambda x: np.where(x["NOverDeadline"] < 1, "0",
np.where(x["NOverDeadline"] < 10, "1-10", "GT10"))).\
groupby("NOverDeadline", as_index = False).\
agg(CountJobOverDeadline = pd.NamedAgg("NOverDeadline", "count"))
deadlines_missed_count_fig = ggplot(deadlines_missed_counts_df,aes(x = "NOverDeadline", y = "CountJobOverDeadline")) +\
geom_col() +\
theme_538() +\
xlab("Previous Deadlines Missed") +\
ylab("Jobs With Missed Deadlines (Count)") +\
ggtitle("Count of Jobs with Missed Deadlines \n by Previous Deadlines Missed by Agent")
deadlines_missed_count_fig.draw()
So marginally more jobs that resulted in missed deadlines were submitted
by agents with previous missed deadlines.
Let’s see what the number of previous jobs submitted by revvers tells us about missed deadlines.
We’ll do this by separating jobs by DeadlineRespected, and plot the number of previous jobs fulfilled by the revver
via boxplots.
nprevjob_missed_deadlines_fig = ggplot(rev_df, aes(x = "DeadlineRespected", y = "NPrevJob")) +\
geom_boxplot(outlier_shape = "") +\
ylim(0,60) + \
ggtitle("Previous Jobs Completed by Deadline Respected")
nprevjob_missed_deadlines_fig.draw()
It seems like a large majority of jobs that resulted in missed deadlines were submitted by revvers with
a smaller number of previous jobs fulfilled. This coule be a direct result of inexperience, or poor time
management.
Let’s get to the rest of the features we’re going to create. We just found a way to measure not only
how many jobs a revver has previously accepted, but the number of jobs where they exceeded the deadline.
Fortunately, we also have a record of the revver’s measured duration in completing their previous job, and the
expected job duration when they accepted the job.
Using these two variables, we can create a metric that compares the current expected job duration, and an aggregate measure of the revvers previous job durations. To do this, we’ll use an informal version of a z-score. We’ll calculate an expanding average of previous job durations, as well as the expanding standard deviation. With these two metrics, we can calculate how many standard deviations the current expected job duration is from the mean.
We’ll also follow the same methodology for measuring the z-score for a revver’s previous expected job duration, and the current expected job duration, and count the number of jobs previously submitted by customers that have resulted in missed deadlines.
### Calculate running standard deviation and average, for expected durations of previous jobs.
rev_df["StdExp"] = rev_df.sort_values(["AgentId", "JobStartedOnTS"]).groupby(["AgentId"], as_index = False)["ExpectedJobDurationSeconds"].expanding().std().reset_index(drop = True)
rev_df["AvgExp"] = rev_df.sort_values(["AgentId", "JobStartedOnTS"]).groupby(["AgentId"], as_index = False)["ExpectedJobDurationSeconds"].expanding().mean().reset_index(drop = True)
### Calculate the z-score for the job of interest using running standard deviation and mean of previously accepted jobs.
rev_df = rev_df.assign(StdExp = lambda x: np.where(x["StdExp"].isna(), 0, x["StdExp"])).\
assign(ZScore = lambda x: (x["ExpectedJobDurationSeconds"] - x["AvgExp"])/x["StdExp"]).\
assign(ZScore = lambda x: np.where((x["ZScore"].isna()) | (x["StdExp"] == 0), 0, x["ZScore"]),
DeadlineRespected = lambda x: np.where(x["DeadlineRespected"] == True, 1, 0))
### Count number of previously submitted jobs by customer that has ended in non-fulfillment within deadline.
rev_df = rev_df.assign(CustomerNOverDeadline = lambda x: np.where(x["DeadlineRespected"] == 0, 1, 0))
rev_df["CustomerNOverDeadline"] = rev_df.sort_values(["JobStartedOnTS", "CustomerId"]).groupby(["CustomerId"], as_index = False)["CustomerNOverDeadline"].cumsum()
### Calculate the actual job duration.
rev_df["ActStdExp"] = rev_df.sort_values(["AgentId", "JobStartedOnTS"]).\
groupby(["AgentId"], as_index = False)["ActualJobDurationSeconds"].\
expanding().\
std().\
groupby(level = 0).\
shift().\
reset_index(drop = True).\
fillna(0)
rev_df["ActAvgExp"] = rev_df.sort_values(["AgentId", "JobStartedOnTS"]).\
groupby(["AgentId"], as_index = False)["ActualJobDurationSeconds"].\
expanding().\
mean().\
groupby(level = 0).\
shift().\
reset_index(drop = True).\
fillna(0)
rev_df["AvgSubBeforeDeadline"] = rev_df.sort_values(["AgentId", "JobStartedOnTS"]).\
groupby(["AgentId"], as_index = False)["ElapsedSecondsAfterDueOn"].\
expanding().\
mean().\
groupby(level = 0).\
shift().\
reset_index(drop = True).\
fillna(0)
rev_df = rev_df.assign(ActZScore = lambda x:\
np.where(x["ActStdExp"] == 0,
0,
(x["ExpectedJobDurationSeconds"] - x["ActStdExp"])/(x["ActStdExp"])))
First, we sort by AgentId, and JobStartedOnTS, and then we group by AgentId and take the expanding standard deviation of the previous expected job durations, (not including the current one). We use the same process to obtain the mean expected duration of previous jobs. Both of these metrics, and the current expected job duration is all we need to obtain a ZScore. We follow an identical process to obtain ActZScore which is the ZScore of the current expected job duration against the actual durations of previous jobs.
Let’s take a look and see how our response variable varies with these two metrics. We’ll do this by generating two separate boxplots for each individual class of DeadlineRespected, filtering outliers out of the visualization to avoid skewing.
z_score_prev_exp_fig = ggplot(rev_df.assign(DeadlineRespected = lambda x:\
np.where(x["DeadlineRespected"].astype(str) == "1", "Yes", "No")),\
aes(x = "DeadlineRespected", y = "ZScore")) +\
geom_boxplot(outlier_shape = "") +\
xlab("Deadline Respected") +\
ylab("Expected Deadline Length Z Score") + \
ggtitle("Z-Score of Current Expected Deadline\n Against Previously Expected Deadlines ") +\
ylim(-2.5,3)
z_score_prev_exp_fig.draw()
z_score_prev_exp_fig = ggplot(rev_df.assign(DeadlineRespected = lambda x:\
np.where(x["DeadlineRespected"].astype(str) == "1", "Yes", "No")),\
aes(x = "DeadlineRespected", y = "ActZScore")) +\
geom_boxplot(outlier_shape = "") +\
xlab("Deadline Respected") +\
ylab("Expected Deadline Length Z Score") + \
ggtitle("Z-Score of Current Expected Deadline\n Against Previous Actual Job Durations") +\
ylim(-5,12)
z_score_prev_exp_fig.draw()
It’s apparent that ZScore is shifted higher for jobs that failed to meet deadlines. This is the ZScore of the current job’s expected duration against the previous expected job durations claimed by the revver. The shift tells us that substantial number of jobs that ended in deadlines being exceeded were longer than the previous durations of the jobs claimed by the revver.
Unfortunately, there’s not much deviation with ActZScore but we’ll include it either way for good measure.
Now that we have created our new metrics, it’s time to train our models! In this article we’ll be using the GLM and Random Forest algorithms to create our classification models. First, we’ll go ahead and trim away any variables that aren’t pertinent to our model, afterwords we’ll double check and make sure our dataset is void of nulls.
mod_rev_df = rev_df.loc[:,["AgentLevelWhenJobClaimed",
"ProjectLengthSeconds",
"ProjectLengthSegment",
"ExpectedJobDurationSeconds",
"NPrevJob",
"NOverDeadline",
"StdExp",
"ZScore",
"CustomerNOverDeadline",
"DeadlineRespected",
"ActZScore",
"ActAvgExp",
"ActStdExp",
"AvgSubBeforeDeadline"]]
mod_rev_df.isna().sum()
## AgentLevelWhenJobClaimed 0
## ProjectLengthSeconds 0
## ProjectLengthSegment 0
## ExpectedJobDurationSeconds 0
## NPrevJob 0
## NOverDeadline 0
## StdExp 0
## ZScore 0
## CustomerNOverDeadline 0
## DeadlineRespected 0
## ActZScore 0
## ActAvgExp 0
## ActStdExp 0
## AvgSubBeforeDeadline 0
## dtype: int64
Fantastic. Let’s print out a header of our dataset.
## AgentLevelWhenJobClaimed ProjectLengthSeconds ProjectLengthSegment
## 1 2 600 3
## 2 1 1020 3
## 3 1 600 3
## 4 1 720 3
## 5 1 278 1
## 6 1 1440 3
## ExpectedJobDurationSeconds NPrevJob NOverDeadline StdExp ZScore
## 1 13800 0 0 0.000 0.0000000
## 2 22200 0 0 0.000 0.0000000
## 3 13800 1 0 5939.697 -0.7071068
## 4 16200 2 0 4326.662 -0.2773501
## 5 6600 3 0 6452.906 -1.2552483
## 6 30600 4 0 9043.893 1.4064740
## CustomerNOverDeadline DeadlineRespected ActZScore ActAvgExp ActStdExp
## 1 7 1 0.000000 0 0.000
## 2 47 1 0.000000 0 0.000
## 3 52 1 0.000000 13860 0.000
## 4 1 1 2.602242 10680 4497.199
## 5 2 1 1.069464 10540 3189.232
## 6 10 1 5.607099 8625 4631.382
## AvgSubBeforeDeadline
## 1 0
## 2 0
## 3 -8340
## 4 -7320
## 5 -6860
## 6 -6075
Next, we’ll separate our explanatory and response variables into numpy objects named X and y respectively. The Sci-Kit Learn package only accepts data in the numpy array type. The explanatory variables must be in MxN dimensions where N is the number of explanatory variables and M is the number of rows (or observations). The response variable y is an M x 1 array. Obtaining the data in this datatype is no big deal, as we can just drop the response variable from our dataset, and use the values attribute to obtain X. For y, we can limit our scope to the response variable solely, and use the value attribute again.
X = mod_rev_df.drop(columns="DeadlineRespected").values
y = mod_rev_df.DeadlineRespected.values
X = sk_preprocessing.scale(X)
We’ll make sure to scale our response variables to avoid any issues that could occur due to outliers in our data set. Finally, we’ll split our data set into training and testing data sets, and define some parameters to train across. We’ll be using the GridSearchCV function in SKLearn that takes a parameter grid as an argument, the estimator object, and the number of cross validation folds. We’ll opt to use a 10-fold cross validation scheme.
Finally, we’ll fit our model with the fit method, supplying our training explanatory and response variables respectively.
X_train, X_test, y_train, y_test =\
sk_model_selection.train_test_split(X, y, stratify= y,train_size=0.85, random_state= 991)
c_space = np.logspace(-5, 8, 15)
param_grid = {"C": c_space, "penalty": ['l2']}
logreg = sk_linear_model.LogisticRegression(max_iter = 1000)
logreg_cv = sk_model_selection.GridSearchCV(logreg, param_grid, cv = 10,verbose=0)
logreg_cv.fit(X_train, y_train)
## GridSearchCV(cv=10, estimator=LogisticRegression(max_iter=1000),
## param_grid={'C': array([1.00000000e-05, 8.48342898e-05, 7.19685673e-04, 6.10540230e-03,
## 5.17947468e-02, 4.39397056e-01, 3.72759372e+00, 3.16227766e+01,
## 2.68269580e+02, 2.27584593e+03, 1.93069773e+04, 1.63789371e+05,
## 1.38949549e+06, 1.17876863e+07, 1.00000000e+08]),
## 'penalty': ['l2']})
The output is a print-out of all parameters GridSearchCV used to train your model, it finds the best parameters across these and automatically defaults to use those parameters.
logreg_cv.best_params_
## {'C': 0.0007196856730011522, 'penalty': 'l2'}
All that’s left is to assess the resulting model! The standard metric for assessing how effective a dichotomous classification model is, is the ROC’s (Receiver Operator Characteristic) area under curve. First, we’ll visualize the ROC for our GLM model.
y_pred = logreg_cv.predict(X_test)
y_score = logreg_cv.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = sk_metrics.roc_curve(y_test, y_score, pos_label = 1)
roc_dict = {"fpr": fpr,
"tpr": tpr,
"thresholds": thresholds}
roc_df = pd.DataFrame(roc_dict)
glm_roc_df = ggplot(roc_df, aes(x = "fpr", y = "tpr")) +\
geom_line() +\
geom_abline(aes(slope = 1, intercept = 0), color = "red")
glm_roc_df.draw()
We can easily retreive the AUC by utilizing the roc_auc_score function, and supplying the test set response variable, and the predicted probabilities obtained by the model.
sk_metrics.roc_auc_score(y_test, y_score)
## 0.8519001779626711
Training our random forest model is relatively similar. The only difference here is the specified parameter grid we’ll be training across. In this instance, we’ll vary our n_estimators which indicates the amount of decision trees in our model. We’ll use the gini criterion, and select auto for max features.
n_estimators = np.arange(500,800,100)
criterion = ["gini"]
max_features = ["auto"]
param_grid_rf = {"n_estimators":n_estimators,
"criterion":criterion,
"max_features":max_features}
rf_class = sk_ensemble.RandomForestClassifier()
rf_cv = sk_model_selection.GridSearchCV(rf_class, param_grid_rf, cv = 10)
rf_cv.fit(X_train,y_train)
## GridSearchCV(cv=10, estimator=RandomForestClassifier(),
## param_grid={'criterion': ['gini'], 'max_features': ['auto'],
## 'n_estimators': array([500, 600, 700])})
Let’s take a quick peak at our ROC plot, and print our ROC AUC.
y_prob = rf_cv.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = sk_metrics.roc_curve(y_test, y_prob, pos_label = 1)
roc_dict = {"fpr": fpr,
"tpr": tpr,
"thresholds": thresholds}
roc_df = pd.DataFrame(roc_dict)
ggplot(roc_df, aes(x = "fpr", y = "tpr")) + geom_line() + geom_abline(aes(slope = 1, intercept = 0), color = "red")
## <ggplot: (339736284)>
sk_metrics.roc_auc_score(y_test, y_prob)
## 0.8639502111874617
Not bad! A marginal improvement over our GLM model.
In this article, we managed to explore the underlying data set that would inform our model. We checked to make sure it was void of nulls, and examined some metrics to assess if they offered any value towards predicting our response variable. Since our data set wasn’t readily useful for modeling, we created some simple metrics based off of revver agent’s prior work. Subsequently, we utilized those metrics to create a model using Logistic Regression, and Random Forest algorithms from Sci-Kit Learn, where we found that the ROC AUC for the random forest model was marginally higher than our logistic regression model.
This model could be easily employed in a production environment to de-emphasize or restrict particular jobs to a revver when the model predicts that the revver has a high probability of failing to meet the deadline. Since this model heavily relies on a revver’s prior work, this model could be restricted to revers with a minimum amount of completed jobs.