One of the questions I get more frequently when I say that I am a Data Engineer is “but what do you do?”. Depending on who you ask, you can receive a variety of answers, especially related to the tools and skills we may use. If you ask me that question and you know about data processes, I will answer “we’re the backend of the Data Scientists”. If you don’t have a clue about data-related jobs, I will answer “we organize big amounts of data”. And by “organize”, what I mean is “clean, model, calculate, store and make available”.
In Everli, I’m in charge of data related to our shoppers, from the moment they apply to one of our job offers until they deliver an order to a customer. This process involves more factors than you may think because we need to have the right amount of shoppers recruited and the right amount of shoppers working exactly when they are needed. The main goal is to balance a good service to our customers and the wellbeing of our shoppers.
This balance has been a challenge since the beginning, and this example is only one more step in finding the best fit for our needs. It has also been a data modeling challenge, and a great opportunity to dig deeper into shopper data. And now I can explain one of the things a Data Engineer does!
What is saturation and why is it so important?
Saturation is a way of measuring the amount of work done by a courier, but the actual math under that has evolved along with our data.
The key is to balance the amount of work we propose to our couriers, so they are not overworked and they have time to give a perfect service to our clients, but their workload is enough to fulfill their available working hours. If we’re not able to measure their workload accurately, any effort to balance this will be misled.
How did we do that in the past (and what problems did we face)?
A few years ago, when we didn’t even have a proper Data Engineering team, we calculated the saturation by dividing the number of orders delivered by a courier in a day by the number of hours available of that particular courier on that day. Why did we do this? because we didn’t have a precise calculation of Total Order Time (i.e.: the time necessary to execute an order, which includes the time necessary for going to the store, pick the items, pay and deliver the order), so an estimation of 60 minutes per order seemed accurate. We needed something to give us a clue, and an estimation is better than no information at all. And, last but not least, we didn’t think about batching.
Batching is not bad in itself. Actually, batching is very good because it means that one courier processes more than one order at the same time. That means less average Total Order Time, so the courier can earn more money and we increase efficiency. The problem, as always, was because it was an unknown-unknown risk: we were not aware of them nor we understood. So, if we saw a courier who had 8 orders per 8 working hours, we would estimate a 100% saturation, we would think they were oversaturated and we wouldn’t propose them more orders. But, if they were able to batch the orders in pairs, that would mean that the real saturation was, more or less, half of that. As we didn’t know, we didn’t send them more orders, so shoppers were losing the possibility of doing additional work, and we couldn’t offer the best of our service.
What are we doing now?
Now we have access to better Total Order Time calculations and have a bigger Data Engineering team that can spend time in processing all this data and transforming it into information. This allows us to have a minute-detail saturation, but it wasn’t easy. The purpose of this article is to explain the data modeling challenge we had to face, and the solution we arrived at.
So, in short, we need to transform data from an order level to a shopper-date-hour level. Let’s introduce our example. We need to go
If we draw a Gantt-like graph for the first table, we can see the problems more clearly:
The challenges we face modeling this are:
- We need to transform the granularity, and that’s always tricky.
- We have to take into account the gaps inside hours.
- We also have to take into account the overlaps and don’t double count this time.
- The tables we’re working with are big, so it is mandatory to avoid high consuming routines.
What we tried (and didn’t work)
We tried dividing the worked hour into time slots, setting the start and end of each one, and then picking the minimum start time and maximum end time for each hour to end with only one row per hour. The time difference between start and end would be the worked time.
Let’s see it on tables:
- We divide the hours into time slots and marking the max and min:
2. Then we group by hours and picking minimum start time and maximum end time:
As you can see, the saturation at 11 am is 60, but it should be 40. This model isn’t able to recognize the empty gaps because it expects to have the end timestamp after the start timestamp.
In short, any time difference model will have problems identifying gaps or overlaps. Why? Because we don’t only need to know the number of minutes, but also the position among other minutes. So, what type of data includes information about amounts and positions? Exactly! Vectors!
So, vectors then.
If we create one binary column per minute in an hour and mark it as ‘true’ if the courier worked that minute in one particular order, and then group by the hour, and then counting the ‘true’ we have at the end, we will have the number of minutes worked! The con is that we will have to create sixty SIXTY (60) columns and operate them. The pro is we didn’t have any other idea so it was worth a try. Let’s do this step by step.
For legibility reasons, I’m using 10 minutes block columns so we can have 6 instead of 60.
Step 1: Create and fill the “board” with the worked minutes
Step 2: Group by the hour, but do it binary (OR operator)
Step 3: Count flags and sum up the minutes.
Do you see the ‘40’ in the minute’s column in the 11 am row? yes? it’s because IT WORKED.
One of our concerns was the performance of this solution. We are talking about managing a couple of seven-figure tables, with a lot of columns of their own, so adding 60 columns more is something we should manage carefully. In the end, with a couple of modeling tricks, a wise selection of the operations to perform and because Domo allows us to run this ETL on Amazon Redshift, the execution of this process didn’t suffer any big increase in time or resources demand.
Also, this model is flexible enough to take into account different shopper roles and all possible ways of batching orders. As a company, we try to improve our processes, and our logistics models are always evolving to find the best fit for our shoppers and our clients. We, as Data Engineers, have the responsibility of designing long term solutions, and we’re confident that this is robust enough to face the challenges of those new ways to distribute the work.
Finally, this was the first time we had to face modeling like this. I’m writing the article and I am the one in charge of this data, but we came to this solution together, as a team. I could not wish to have better coworkers to spend a Friday afternoon brainstorming with.
And, FYI, I presented this to all my Tech coworkers, with more puns and an adequate amount of memes. If you want to be there next time, we are hiring! (yes, remote!) (yes, Data Engineers too!)