Han Solo: a Feed Processing Story

Reading Time: 5 minutes

Some questions that anyone using Everli might wonder about: how do we get all the products onto the web and mobile apps? And how do we keep our catalog and prices daily updated?

This process has taken a bit of work and, for now, a couple of iterations. In this post, we’ll focus on the first iteration. Let’s try to shed some light on it!

The mission

Our mission is to provide the best catalog for our users with the most accurate prices, item details and availability.

For this to happen, we update our catalog nightly with feed data, provided by our integrated retailers and other data sources. This way, when the users start shopping, they will always have the information up to date.

Our goal is to be as convenient as we can for our partners, that is, being an enabler for them. We provide channels where they can share their feeds with us, focusing on flexibility.

In order to process the feeds in an accurate, reliable, and scalable manner, we need a dedicated service to perform this task.

Han Solo

As mentioned in another post, our retailer’s feed parser is known as Han Solo.

We’ve been calling internal processes with Star Wars names for a bit now, so don’t fret, Han Solo is not flying solo.

He’s in good company with close friends such as Leia and Lando! 😁

We’ve been talking about feed for a bit now, so what exactly is a feed? It’s nothing more and nothing less than a CSV file containing the information for all items that a store needs to have available.

To process a feed, Han Solo has to:

  • Get the feed
  • Normalize it (the shape and information in the feed can differ depending on the retailer):
    • Extract the information from the received one
    • Transform this information: the feed’s shape and information can differ depending on the retailer. To tackle this, a new feed is generated from the original one’s information, but respecting a standardised shape so further steps can be agnostic of how the starting feed looked like
  • Check the differences between the last feed for this store and the current one, a.k.a. delta
  • Update the info using just those differences.

A decoupled nature

Each stage is considered a job. A job uses the output of the previous job (if any) as its own input.

Even if each job was quite decoupled from the other, all of them are actually executed in the same process.

han_solo_flow

In the first steps, up to the Update job, each job has one input and one output: the feed itself and the feed again, after the transformation it underwent during the job. Nevertheless, within the Update there are several things to do for each of the items in the feed. Since it would be very slow and wasteful to process just one item at once, wait for it, and then move on to the next one. After Update, the steps are done in a different way.

Enter concurrency

“I had a problem, so I introduced multiprocessing”

“I ha-lems 2 pro-an ow”

Each of the steps in Update will take place in a pool of processes. The idea of stages persists at this point. Each pool of processes will feed the next one, and each process can consume items from a multiprocessing queue. We went with multiprocessing over multithreading because the tasks we carry out are CPU demanding. That means that from this the Update onward each feed is not treated as a whole, but it’s each item that is a processing atom.

Of course, our friend Han Solo couldn’t be perfect. We’ll explore some of the issues that can be highlighted.

Clean code, where art thou?

Sometimes we need things that work, and we need them fast. In these instances, we consciously trade some good practices for being able to actually use a finished (albeit improvable) product.

Using this approach we can iterate and move quickly and study what works, what doesn’t, and how to improve it in the next iterations.

It proved to be the right (and necessary) approach at the time, as we managed to scale our stores and replace quickly the previous project that hit its upper limit. However, those trade-offs hit us back:

  • Lack of coverage by unit tests
  • Rigid code that was not very well decoupled
  • The multiprocessing logic had a lot of areas to improve on
  • Although the stages were very decoupled as an idea, the code had some dependencies that made it complicated to test them as isolated components

The time window

Other than the flexibility and agility, most of the issues had something to do, directly or not, with complying with the time window.

The time window is just the time during which, every night, we can do our feed processing, so when we are done the prices can be updated. The time window is crucial for the process:

  • When it’s over, we bulk update all the prices that were affected by the feed processing
  • If some processing is not done by the time it closes, then the remaining feeds are lost for the day
  • We must respect that time so we’re ready for the morning traffic

As we grow, it’s harder to comply with the time window during which we have to process all the feeds that we get. This means we need a more performant, more scalable solution than the current one.

Everyone likes graphs!

As a showcase of the growth that we need to sustain, we can see some metrics for the ingested feeds from 2020 to 2022.

It’s also interesting to note that from 2021 onward we’ve been slowly but steadily replacing Han Solo with its next iteration until no more stores run through Han Solo.

feed_rows_raw_monthly
Raw rows found in each feeds (some will get dropped)

External service needs

There were constraints within Han Solo, also in related services. In the way we are processing the feeds, it’s not really possible to run a feed at any time of the day. Some information (like prices) can only be updated in bulk in the early morning.

Wait for that result

Every time we run a feed, we wait for it to be done and get its result to be able to start the next one. This is also how we know if the run was successful or not, and thus are able to later retry if there was some error.

As a direct consequence, each Han Solo process that starts the feed run is blocked until it’s done. Since the stages are not separate processes, we may only run, at the same time, as many feeds as the number of Han Solo instances running!

Retry that error

If we get an error that is interpreted as something that can be fixed by running the feed one more time (i.e. the feed is not available yet, a connection error, etc.), we’ll retry the run. That means, retrying from the beginning to the end. This error retrying logic can be further improved since, as we’ve talked about, the jobs are pretty decoupled, which in most cases might mean that retrying the job that failed could be enough to solve the problem. Retrying the whole batch, in turn, becomes too much of a waste (regarding the time we spend doing it).

How does the future look?

Well, at this point we’ve painted a rather bleak picture. Does that mean Han Solo is doomed? Of course not!

As one of the first Python projects at Everli, Han Solo taught us very valuable lessons: what to do, and what not to do.

We’ll see how we kept these in mind for the next iteration in another post!

Panic
via Tenor

We keep iterating, and we keep improving. If you want to tackle the evolving challenges that we face, check out the current openings.

Author: @arudp

Leave a Reply

Your email address will not be published. Required fields are marked *