Organizing Messy Data

In a recent project, the client asked Maderas Partners to run a series of calculations related to their employees’ work history. The calculations themselves were rather basic. However, the real challenge was organizing the dataset in such a way that allowed for the calculations to be run for all employees over the entire time period.

Note: As always, the data in this example have been fully anonymized.


  1. Convert very irregular shift data for 173 employees into an easily readable format
  2. Perform the following calculations for each shift: total length of shift, length of lunch break, amount of time worked prior to lunch, amount of time worked following lunch
  3. Determine the following characteristics of each shift: lunch breaks that were exactly 30 minutes in length, shifts long enough to qualify for lunch but no lunch was taken, lunch breaks that occurred very early in the shift, and lunch breaks that occurred very late in the shift.

Primary challenge: A very irregular dataset

This sample dataset runs from August-December 2017 and is arranged by employee. For example, the dataset lists all shifts worked by Employee #1 from August-December 2017 and then lists all shifts for Employee #2, then Employee #3, and so on. However, the employee number is only listed once at the very start of that employee’s data.

Employees of this company often worked split shifts, e.g. an employee with an 8-hour shift working 3 hours in the morning and then returning to work several hours later that day to complete the remaining 5 hours. In the above example, Employee #163 worked a split shift on 8/1/2017, but a regular shift with a 28-minute lunch break on 8/3/2017.

Additionally, shifts are sometimes split between two lines at the lunch break or shift split, but at other times, they are all listed on a single line

Python Script

Given the tasks and challenges, we decided that Python 3 was best suited for this analysis.

The first task was to fill in the missing employee numbers so that each shift was matched to an employee. Also, as you can see in the examples above, there are a lot of empty lines and lines with dates but no shift data.

Here is how the raw dataset appears in a pandas dataframe:

Our first step was to fill in the missing ID numbers and delete all blank spaces. The script below deletes all rows that have no data for the first clock in time at the start of the shift (“In”):

Next, we transformed the basic time data into datetime by combining the date column and each punch of the clock to form four new columns: Punch 1-4

In order to make this easily readable for the client and easier to perform subsequent calculations, we wanted the shifts that were separated on several lines for the same day to all be on the same line. We grouped the data by ID Number and Date and then created two new columns: Punch 3a and 4a.

You’ll notice on line 102, that we pulled the information from Punch 1 and 2 from line 104 and applied it to Punch 3a and 4a. This transformed each workday from split between two lines to just one. This brought to light a new problem that will present itself once we begin calculating the length of each shift. Notice on line 102 that the final clock out (Punch 4a) occurred at 2:24am on 8/2/17 but the date is listed as 8/1/17, the date that the shift began. Leaving the date as such will make it impossible to accurately calculate the full shift length in Python. Therefore, we wrote the following code to check whether the time in Punches 2-4 was less than that of Punch 1 and if so, add 1 day onto the date. We then created 4 new columns to include any date adjustments: Punch 1Adj-4Adj

Compared to the previous example, the final clock out on line 102 is now 8/2/2017, with Punches 1-3 occurring on 8/1/2017. At this point, there are still two lines for this shift: one with the data as it originally was (the last line in the example above) and one with the data from Punch 1 and 2 moved to Punch 3 and 4. To clean this up, we used the following code:

At this point, the hard part (Task 1) is over: the employee numbers have been filled in, blank lines have been deleted, each shift is presented on a single line, and dates for shifts that extend past midnight have been adjusted as necessary. We are now working with a clean, well-organized dataset with no empty spaces and no duplicates. What comes next is series of fairly simple calculations to satisfy Tasks 2 and 3, as listed at the top of the post.

There you have it. With the above Python script, we were able to take the raw Excel files furnished by the client and transform them into a single, easily read dataframe. Please see our Github page for the raw data sample and the finished product.

End Notes

While we used the structure of the dataset in its original form, all data, including employee ID numbers, have been anonymized.

The full Python code and sample dataset can be found on Github. The above represents the bulk of the analysis that we ran, but there were some additional clean up and adjustments that were made that are not reflected here or on Github.

K-Nearest Neighbors, Part II

We decided to build a model to predict whether a stop would result in a frisk, rather than an arrest. Based on the results of the linear regression, there appears to be more variability regarding frisks, likely leading to a more interesting model. An arrest is a fairly objective decision that can be predicted based on the law. However, an officer has more discretion regarding whom to frisk, which would make a machine learning model that could predict human behavior potentially more interesting.

Relying on the same dataset used in the linear regression, we built a K-Nearest Neighbors predictive model in Python. The first step was to import the pertinent libraries and upload the data.

We then pared down the dataframe to only the columns of interest for our model (df_features), excluding the ‘Frisked’ variable, which is the outcome that we want to predict.

As is common in machine learning exercises, we divided the dataset into two sections: one used to build the model and the other to test the accuracy of the model that we just built. We built our model using on 75% of the dataset (the Training Set). The remaining 25% (the Test Set) was used to test the accuracy of the model. In building the model, Python only considers the data in the Training Set. We then employ that model to predict the values of the Test Set. Because we know from the data which subjects were frisked in reality, we can compare the values that the model predicted to the actual values to determine how accurate our model is.

In the above, X are the variables used to predict the outcome and y is the variable that is predicted (Frisked).

(Very) Brief KNN Overview

KNN groups the data into sections. Imagine a 2-dimenionsal graph that plots the data along two axes, X and Y. Further imagine that your dataset can be divided has 2 distinct groups, with one group plotted towards the lower left-hand corner of the graph and the second plotted towards the upper right-hand corner of the graph. These are your two “neighbors”. Now, let’s assume that you have an additional point that is plotted somewhere between the two neighbors and you want to determine which neighborhood this new point should be grouped with. KNN looks at the already established groups, or neighbors, and assigns the new point to the closest neighbor. In other words, if the new point is plotted a bit closer to Group 1 than Group 2, KNN will assign the new point to Group 1, its nearest neighbor. That is easy to understand visually if you only have 2 or 3 groupings, but if you have, for example, 15 groupings, it becomes a bit more difficult. In this exercise, we start out with just 2 groupings  to see how the model performs and then make adjustments as necessary.

The two primary components of our accuracy measurement are precision and recall. Precision measures what percentage of the model’s predictions were accurate. For example, if it predicted that there were 150 frisks and 75 of those predictions were correct, the precision would be 50% (75/150). The recall measures what percentage of the total the model predicted correctly. In other words, if the model predicted the same 75 frisks, but there were actually a total of 300 frisks in the entire dataset, the recall would be 25% (75/300)

In the above results, the precision and recall for non-frisks was really good, at .80 and .94, respectively. However, it did not do so well with predicting the stops that resulted in a frisk with a precision of .47 and recall of 0.18. One way to try to improve the model is to adjust the number of groups. Choosing a more accurate number of groups can be done by employing the “Elbow Method”, which iterates through several different group sizes to find the best one, the one with the lowest error rate.

Here you can see that there is a large drop off from around k=5 to k=6, with k=6 having a much lower error rate than or original k=2. Therefore, k=6 could be a good choice, but looking further, the error continues to drop.  We tried a few values for k along that range and settled on k=12.

You can see from the above results that compared to our original k=2, our precision and recall measurements for no-frisks increased slightly. Turning to frisks, the precision increased dramatically from 0.47 to 0.71, but recall unfortunately fell from an already low .18 to an even lower .16.

In summation, our model was accurate in its predictions in that there were a low number of false positives in both the frisk and non-frisk predictions. However, there were a very large number of false negatives for the frisked predictions. In other words, if the model predicts that a stop will result in the subject being frisked, we can be fairly certain that it is correct, but if it predicts that a subject will not be frisked, there’s a really good chance that it’s wrong.

End Notes

The purpose of this post was to demonstrate the KNN technique using actual data. If running a similar analysis in the real world and you are confronted with divergent precision and recall readings as we have here, you will need to decide whether it is most important to minimize false positives (high precision value) or minimize false negatives (high recall value).

If you’re interested, we ran the model on the entire dataset and included a column that displayed the actual value and the predicted value for each Terry stop so that you can compare the results. This can be found on our GitHub page. You can also find on GitHub the original dataset, the Stata tables from the linear regressions, and the full Python code.

Thanks to the City of Seattle for freely sharing their data. This dataset as well as many others can be found on the city’s Open Data Portal.