Organizing Messy Data

In a recent project, the client asked Maderas Partners to run a series of calculations related to their employees’ work history. The calculations themselves were rather basic. However, the real challenge was organizing the dataset in such a way that allowed for the calculations to be run for all employees over the entire time period.

Note: As always, the data in this example have been fully anonymized.


  1. Convert very irregular shift data for 173 employees into an easily readable format
  2. Perform the following calculations for each shift: total length of shift, length of lunch break, amount of time worked prior to lunch, amount of time worked following lunch
  3. Determine the following characteristics of each shift: lunch breaks that were exactly 30 minutes in length, shifts long enough to qualify for lunch but no lunch was taken, lunch breaks that occurred very early in the shift, and lunch breaks that occurred very late in the shift.

Primary challenge: A very irregular dataset

This sample dataset runs from August-December 2017 and is arranged by employee. For example, the dataset lists all shifts worked by Employee #1 from August-December 2017 and then lists all shifts for Employee #2, then Employee #3, and so on. However, the employee number is only listed once at the very start of that employee’s data.

Employees of this company often worked split shifts, e.g. an employee with an 8-hour shift working 3 hours in the morning and then returning to work several hours later that day to complete the remaining 5 hours. In the above example, Employee #163 worked a split shift on 8/1/2017, but a regular shift with a 28-minute lunch break on 8/3/2017.

Additionally, shifts are sometimes split between two lines at the lunch break or shift split, but at other times, they are all listed on a single line

Python Script

Given the tasks and challenges, we decided that Python 3 was best suited for this analysis.

The first task was to fill in the missing employee numbers so that each shift was matched to an employee. Also, as you can see in the examples above, there are a lot of empty lines and lines with dates but no shift data.

Here is how the raw dataset appears in a pandas dataframe:

Our first step was to fill in the missing ID numbers and delete all blank spaces. The script below deletes all rows that have no data for the first clock in time at the start of the shift (“In”):

Next, we transformed the basic time data into datetime by combining the date column and each punch of the clock to form four new columns: Punch 1-4

In order to make this easily readable for the client and easier to perform subsequent calculations, we wanted the shifts that were separated on several lines for the same day to all be on the same line. We grouped the data by ID Number and Date and then created two new columns: Punch 3a and 4a.

You’ll notice on line 102, that we pulled the information from Punch 1 and 2 from line 104 and applied it to Punch 3a and 4a. This transformed each workday from split between two lines to just one. This brought to light a new problem that will present itself once we begin calculating the length of each shift. Notice on line 102 that the final clock out (Punch 4a) occurred at 2:24am on 8/2/17 but the date is listed as 8/1/17, the date that the shift began. Leaving the date as such will make it impossible to accurately calculate the full shift length in Python. Therefore, we wrote the following code to check whether the time in Punches 2-4 was less than that of Punch 1 and if so, add 1 day onto the date. We then created 4 new columns to include any date adjustments: Punch 1Adj-4Adj

Compared to the previous example, the final clock out on line 102 is now 8/2/2017, with Punches 1-3 occurring on 8/1/2017. At this point, there are still two lines for this shift: one with the data as it originally was (the last line in the example above) and one with the data from Punch 1 and 2 moved to Punch 3 and 4. To clean this up, we used the following code:

At this point, the hard part (Task 1) is over: the employee numbers have been filled in, blank lines have been deleted, each shift is presented on a single line, and dates for shifts that extend past midnight have been adjusted as necessary. We are now working with a clean, well-organized dataset with no empty spaces and no duplicates. What comes next is series of fairly simple calculations to satisfy Tasks 2 and 3, as listed at the top of the post.

There you have it. With the above Python script, we were able to take the raw Excel files furnished by the client and transform them into a single, easily read dataframe. Please see our Github page for the raw data sample and the finished product.

End Notes

While we used the structure of the dataset in its original form, all data, including employee ID numbers, have been anonymized.

The full Python code and sample dataset can be found on Github. The above represents the bulk of the analysis that we ran, but there were some additional clean up and adjustments that were made that are not reflected here or on Github.


The Difference-in-Difference (DiD) analysis is a powerful econometric tool with very practical uses.

The classical use of DID is to analyze the difference between two groups after some sort of treatment. Say, for example, that you want to compare scores of a college entrance exam, the SAT, for two groups of students, one group that participated in an SAT prep course (the treatment group) and one group that did not (the control group). Let’s further assume that all students took the SAT prior to the start of the course. And as a last assumption, enrollment in the prep course was not random. Students were required to express interest in the program.

In the above example, you could simply compare the final SAT scores of the treatment and control group to gauge the success of the enrichment program. For example, if the treatment group scored 1400 on average, but the control group scored 1200, the 200-point difference could be viewed as the effect of the program.

The problem with that approach is that it does not take other important factors into consideration such as time (maybe students just get better at taking the test over time) and starting position (maybe the treatment group scored higher on the first round of testing than the control group, providing a “head start” ). DiD separates the difference in SAT scores, 200 points in our example, into its constituent parts – time, starting position and the effect of treatment – quantifying the effect of each component. Of the additional 200 points the treatment group scored over the control, perhaps 50 of the points were due to the fact that the treatment group are just better test-takers, as evidenced by the original SAT scores, and would have scored on average 50 points higher even in the absence of treatment.  DiD allows to you isolate the pure treatment effect.

To set up the analysis, we created two dummy/categorical variables that take on a value of either 0 or 1:

  • Time: The time variable denoting whether the observation is pre-treatment (first SAT score) or post-treatment (second SAT score)
  • Participant: Whether a student completed the prep course (the treatment)

We then created a variable called time_participant, which is the interaction between these two variables (i.e. the time variable multiplied by the participant variable).

We generally perform econometric analysis in Stata, but it is our understanding that this can also be done in R. Note that t0 refers to the first test and t1 refers to the second.

Variables under consideration:

Score: The SAT score. The dependent variable in this example

Time: The change in SAT score attributable to time, with the assumption being that students performed better the second time that they took the test. The coefficient on this variable represents the increase in score from t0 to t1 for students that did not complete the prep course.

Participant: The change in SAT score attributable to differences between the two groups prior to treatment. This coefficient quantifies the difference between participants and non-participants at t0, i.e. prior to the start of the prep course.

Time_Participant: The difference-in-difference variable. This coefficient displays the difference between program participants and non-participants at t1 while controlling for differences at t0.

The resulting regression equation is:

SAT Score = 1107.59 + 71.126time + 43.168participant + 118.816time_participant

In simpler terms, the above model predicts that all students are expected to score at least 1108 on the first test. Taking the test a second time is expected to boost scores by 71 points. As stated above, participation is not random. There is a selection bias built into the model in that those students that decided to participate in the program already had on average higher scores on the original test than the the students that chose not to participate. The difference in scores between those that chose to participate and those that did not PRIOR to participating in the program (i.e. at t0) is 43 points. And lastly, taking the test a second time and participating in the enrichment program together increases scores by 119 points.

Taken all together, a student that did not participate in the prep course should expect an SAT score of 1179 (constant plus coefficient on the time variable) on the second exam. While a student that participated in the course can expect to achieve 1341 on their second attempt, a difference of 162 points.

In conclusion, the DiD model is most useful in zeroing in on the effect of the treatment while excluding the effect of other variables: time in this example, but potentially also gender, age, or other factors that could affect outcomes. Students participating in the course scored 162 points higher than students that did not, but the effect of the treatment on its own was only 119 points.

An example of how this could be useful is a simple cost benefit analysis. Given the time and expense associated with developing and administering this program, let’s assume that the school set a threshold of 150 points. At 150 points or higher, the program is worth the cost, but below that threshold, it is not. Looking at the pure results, an expected increase of 162 points would exceed the threshold, triggering an expansion of the program. However, through difference-in-difference, administrators see that the program itself is only responsible for 119 points of that gain, thus falling short of expectations.

Note: The sample dataset and the Stata Do-file can be found on Github.

K-Nearest Neighbors, Part I

According to Wikipedia:

A Terry stop in the United States allows the police to briefly detain a person based on reasonable suspicion of involvement in criminal activity. Reasonable suspicion is a lower standard than probable cause which is needed for arrest. When police stop and search a pedestrian, this is commonly known as a stop and frisk.

The City of Seattle publicly released data on each Terry Stop made by Seattle Police Officers since 2015. The data include information about the stop (date, location, reason), about demographic information about the officer and person stopped. The data also include whether the subject was frisked during the stop, whether a weapon was found, and whether the stop resulted in an arrest.

In this exercise, Maderas Partners sought to build a model to predict whether a person would be arrested or frisked during a stop based on the other information provided by the city. To accomplish this goal, we employed a commonly used machine learning technique called K-Nearest Neighbors (KNN).

Before building the model, we wanted to get a better sense of the data to determine whether KNN would be a good fit. First, we looked at the number of stops, arrests, frisks, weapons found broken out by race.

According to 2010 Census Estimates, Whites account for 66.3% of the population and Blacks account for 7.7%. Therefore, it appears that Blacks were disproportionately subjected to Terry stops whereas Asians that make up 13.7% of the population were underrepresented in the number of stops. However, using the number of Terry stops as a baseline, each of the three racial groups were arrested at approximately the same rate (between 25% and 29%), but Whites were frisked less often (25.5% of Asians, 26.3% of Blacks, and 18.7% of Whites).

Next, we ran a pair of OLS linear regressions to determine whether there was any statistical relationship among the primary variables in question. We first examined whether being arrested during a stop was correlated with any of the following variables:

  • Being frisked during the stop
  • Whether the subject was carrying a weapon
  • The officer’s age at the time of the stop
  • The officer’s gender
  • The officer’s race
  • The subject’s gender
  • The subject’s race

Unsurprisingly, we found that undergoing a frisk and possessing a weapon were both correlated with being arrested. Also, male officer tended to make more arrests than female or non-binary officers, while older officers were less likely to make arrests than younger ones. However, the race variables did not provide much insight because there was not much variation among the different races.

Replacing arrested with being frisked as the dependent variable proved to be more interesting. (Also, the R-squared, while not ideal, is much higher.) The age, gender, and weapon variables remained consistent with our findings from above, but the racial variables also demonstrated significance. Black officers stand out as the only group to demonstrate a statistically significant positive correlation with frisking, suggesting that black officers tend to frisk more often than officers of other races. Turning to the subjects of the stops, males are positively correlated with being frisked (but not with being arrested). Most interesting, being black was positively correlated with getting frisked, while being white was negatively correlated. Both were statistically significant at the 5% level. This suggests that in a Terry stop, blacks were more likely to be frisked than all other racial groups, while whites were less likely to be frisked relative to all other racial groups.

Given these preliminary findings, we decided to move forward with the analysis. Part II will outline the process and outcome of our predictive model.

K-Nearest Neighbors, Part II

We decided to build a model to predict whether a stop would result in a frisk, rather than an arrest. Based on the results of the linear regression, there appears to be more variability regarding frisks, likely leading to a more interesting model. An arrest is a fairly objective decision that can be predicted based on the law. However, an officer has more discretion regarding whom to frisk, which would make a machine learning model that could predict human behavior potentially more interesting.

Relying on the same dataset used in the linear regression, we built a K-Nearest Neighbors predictive model in Python. The first step was to import the pertinent libraries and upload the data.

We then pared down the dataframe to only the columns of interest for our model (df_features), excluding the ‘Frisked’ variable, which is the outcome that we want to predict.

As is common in machine learning exercises, we divided the dataset into two sections: one used to build the model and the other to test the accuracy of the model that we just built. We built our model using on 75% of the dataset (the Training Set). The remaining 25% (the Test Set) was used to test the accuracy of the model. In building the model, Python only considers the data in the Training Set. We then employ that model to predict the values of the Test Set. Because we know from the data which subjects were frisked in reality, we can compare the values that the model predicted to the actual values to determine how accurate our model is.

In the above, X are the variables used to predict the outcome and y is the variable that is predicted (Frisked).

(Very) Brief KNN Overview

KNN groups the data into sections. Imagine a 2-dimenionsal graph that plots the data along two axes, X and Y. Further imagine that your dataset can be divided has 2 distinct groups, with one group plotted towards the lower left-hand corner of the graph and the second plotted towards the upper right-hand corner of the graph. These are your two “neighbors”. Now, let’s assume that you have an additional point that is plotted somewhere between the two neighbors and you want to determine which neighborhood this new point should be grouped with. KNN looks at the already established groups, or neighbors, and assigns the new point to the closest neighbor. In other words, if the new point is plotted a bit closer to Group 1 than Group 2, KNN will assign the new point to Group 1, its nearest neighbor. That is easy to understand visually if you only have 2 or 3 groupings, but if you have, for example, 15 groupings, it becomes a bit more difficult. In this exercise, we start out with just 2 groupings  to see how the model performs and then make adjustments as necessary.

The two primary components of our accuracy measurement are precision and recall. Precision measures what percentage of the model’s predictions were accurate. For example, if it predicted that there were 150 frisks and 75 of those predictions were correct, the precision would be 50% (75/150). The recall measures what percentage of the total the model predicted correctly. In other words, if the model predicted the same 75 frisks, but there were actually a total of 300 frisks in the entire dataset, the recall would be 25% (75/300)

In the above results, the precision and recall for non-frisks was really good, at .80 and .94, respectively. However, it did not do so well with predicting the stops that resulted in a frisk with a precision of .47 and recall of 0.18. One way to try to improve the model is to adjust the number of groups. Choosing a more accurate number of groups can be done by employing the “Elbow Method”, which iterates through several different group sizes to find the best one, the one with the lowest error rate.

Here you can see that there is a large drop off from around k=5 to k=6, with k=6 having a much lower error rate than or original k=2. Therefore, k=6 could be a good choice, but looking further, the error continues to drop.  We tried a few values for k along that range and settled on k=12.

You can see from the above results that compared to our original k=2, our precision and recall measurements for no-frisks increased slightly. Turning to frisks, the precision increased dramatically from 0.47 to 0.71, but recall unfortunately fell from an already low .18 to an even lower .16.

In summation, our model was accurate in its predictions in that there were a low number of false positives in both the frisk and non-frisk predictions. However, there were a very large number of false negatives for the frisked predictions. In other words, if the model predicts that a stop will result in the subject being frisked, we can be fairly certain that it is correct, but if it predicts that a subject will not be frisked, there’s a really good chance that it’s wrong.

End Notes

The purpose of this post was to demonstrate the KNN technique using actual data. If running a similar analysis in the real world and you are confronted with divergent precision and recall readings as we have here, you will need to decide whether it is most important to minimize false positives (high precision value) or minimize false negatives (high recall value).

If you’re interested, we ran the model on the entire dataset and included a column that displayed the actual value and the predicted value for each Terry stop so that you can compare the results. This can be found on our GitHub page. You can also find on GitHub the original dataset, the Stata tables from the linear regressions, and the full Python code.

Thanks to the City of Seattle for freely sharing their data. This dataset as well as many others can be found on the city’s Open Data Portal.