Hold onto your seats; I am going to guide you around our brief history of the price updating process.
Prices must be updated every day, following our Retailers procedures to provide the best experience to our customers.
If you think it should be a simple procedure, consisting of just updating the right record and go on with the next one, I totally agree with you!
When we were in the start-up phase back in the days it was that simple, but soon we had to forget about doing it this easily when we had to scale it up from a few thousand updates to millions.
In the next paragraphs, I will show you the evolutions of this process during the years.
In the beginning, I am talking about 2016, our structure and flow were pretty simple:
- Item table, to store item information such as name, description, thumbnails and weight
- Store table, to save shops details, such as brand and address
- ultimately, ItemStore table, to link the previous tables with a N:N relationship, meaning that item is present in the store with a determined price
Two separate systems were involved in this portion of environment:
- our retailer’s feed parser, written in Python3, internally known as Han Solo. The responsibility of this tool is to download the feed files, parse them through tailored logic, reconcile the item properties and propagate the prices
- one Laravel microservice written in Php7, friendly named as Lando, that was responsible for item and price management, saving the right data points in our Database
This flow was pretty straightforward.
For every retailer in our list, we run the parser against the data feed, returning as an outcome a tuple of store, item and price.
For those items, we individually sent prices via REST HTTP requests to the database.
The controller was responsible for fetching the right record in the ItemStore table:
SELECT * FROM item_store WHERE store_id = ? AND item_id = ? LIMIT 1;
If the record was found, we fired the necessary UPDATE statement with the updated price,
otherwise, an INSERT was necessary to create the new record.
Everything was wrapped in a Database transaction.
The performance was good; with three nodes doing the parsing and three nodes for the backend, we were able to successfully handle our first million of prices 🎉
The number of queries necessary to complete the entire flow in the best scenario was N (equals to the total number of the feed lines) and N * 2 at worst.
The number of HTTP requests, instead, was N in any case you could imagine. Then new retailers joined our marketplace and for us, it became clear that we needed to improve the scalability of this solution.
Our first improvement in this journey was the refactoring of the internal flow between the two projects.
We found three ways of tuning:
- we removed Database transactions, since that level of isolation was not necessary and for sure overkill in this situation. The original data from the parsers were unique for the tuple Item and Store
- the refactoring of the framework with a tiny one reducing the general overhead per every REST request. Lumen was our winner, as it reduced the amount of time necessary to migrate all the models
- last, but not least, with the framework refactor we took the decision to introduce a new endpoint to send multiple items in a single REST request. The goal of this decision was to reduce the bottleneck of the HTTP protocol and lower the number of API calls between the projects. Until now, every row processed in the feed resulted in one API call (to price updating microservice) and one or two DB statements (run by this microservice)
We did it!
With the same number of nodes, we were able to duplicate the number of items processed per night.
With no critical issues during the migration of the projects, it was a success for everyone.
We drop the loading time of the framework and the number of necessary HTTP requests for the same hours.
The only concern was the amount of memory used by both Python and Php.
A necessary gambit.
But we weren’t satisfied enough, we knew we could do better!
We tried different ideas and failed at different times.
Only one idea proved to be promising: the delta (or diff) between the entire feed from the day before against the new feed.
Every night we run the entire partner’s feed without any consideration about the previous iterations.
You could agree with me, it was a waste of resources.
The funny fact was that the solution was right under our nose, used constantly in our day to day work.
If you are familiar with Git deltas then you know what I mean. If not then here is a link to the details if you are brave enough 🤓
We used this concept against the feeds we parsed every night.
The important thing is to have the exact same order of records inside your cluster (when doing diffs) to fully benefit from the algorithm.
We changed our parser using the built-in difflib function using the outcome as:
- if the row was removed, we hide the record
- else if the row was added, we activate the record
- otherwise, we update the record
The result was outstanding.
Roughly 65% of items haven’t changed the price versus the previous day.
We realized that we were processing the same immutable item more and more for nothing in particular.
So we decided to process only the changes, even if we are reading the full retailer feed.
We were able to more than duplicate the number of items in the feed to be processed.
Five million of rows, of course with the same amount of nodes.
This was the beginning of 2019.
How about one query to rule them all?
And now we were in the middle of 2019.
We were preparing for a big project: the expansion into new markets.
The goal of this cross-team project was to bring our model and our products in another country which soon proved to be Poland.
As you can imagine, a new country means a whole lot of new items as products are mostly country-specific (naming, labels, nutrition values, ingredients etc.)
We duplicated a lot of items, like Nutella, with a new package, currency, EAN and nutritional values.
We had to scale.
We lacked creativity about new solutions to improve our software without horizontal scalability.
Thanks to our DevOps, we received a very cool trick.
If you remember, the total of queries was something between N and N * 2, where N is the number of changed items, thanks to the delta.
How about one query?
The solution was to create a brand new empty table with the same structure of the ItemStore where we could insert massively records,
to then move the records and the original ItemStore table.
No, ItemStoreUpdate table.
Every night we clean the table before running the parser.
Instead of handling the bulk request in a loop (with N insert queries), we decided to put the entire payload inside the new table in one query.
With this simple new improvement, we aligned the cardinality of HTTP requests with the number of INSERT queries, massively reducing the number of INSERT queries to be performed versus only one:
INSERT INTO item_store_update VALUES (?, ?, ?, ...);
And now the real magic:
UPDATE item_store INNER JOIN item_store_update ON item_store_update.store_id = item_store.store_id AND item_store_update.item_id = item_store.item_id SET item_store.price = item_store_update.price;
Of course you have to sync the new items, which boils down to:
SELECT * FROM item_store_update LEFT JOIN item_store ON item_store.store_id = item_store_update.store_id AND item_store.item_id = item_store_update.item_id WHERE item_store.item_id IS NULL;
Impressive. You could insert those new items with a single INSERT query into ItemStore table.
The result of this change was the huge number of seven million records per night.
The cardinality of this new flow is N, where N is the number of REST requests plus one for the update and other two for new items.
We had to fight against Gap Lock, but at the end, we were proud of the new flow.
What about the future?
We were in the middle of a transactional phase.
If you did not read our previous post, you should spend a few minutes into it.
The current open points that we are working on are as follows:
- the whole process should work during the day, without locking table, changing prices and stock information during the day
- new countries, new items, new currencies
- event-driven pattern, sharing retailer information among our services
- reduce the time to market for new partners and the developer time for generic set-up
We started moving records as messages by our broadcaster, reducing, where possible, the fatty synchronous HTTP protocol to thin AMPQ asynchronous protocol.
We chunked all our INSERT queries since our PDO library can not handle the huge amount of items per bulk request.
Since we are dealing with prices, every bug or typo could affect our economics; that’s why we are adding tons of fallback systems to avoid and minimize those kinds of issues.
At the time of writing, we are handling more than twelve million records per night.
And yes, with only six nodes, we only upgrade the number of vCPUs and the amount of memory per node.
The future is bright in front of us.
If you have the right creativity, you are interested in similar challenges, and you want to solve hard technical problems like these: visit our careers site to see our current openings in StackOverflow.