How we handled the spike of more than 4k% new users in a week.

Reading Time: 7 minutes

Covid-19 changed and is changing everything.
From customer behaviour to customer spend, no one could have expected this.
Our servers neither.
This post would like to be a simple overview of what we did to maintain our website and apps reliable during this time.
And yes, you have read properly, how we kept the infrastructure responsive while serving more than 4000% new users (and of course the previous one) plus a few DoS attacks.

The base.

Everli (formerly Supermercato24) is an Italian scale-up born in late 2014, and our technical team (39 people at the time of writing) is fully remote; and if you did not read our previous post, you should spend a few minutes into it.

Back in February, our infrastructure was pretty basic (as technology) and cheap (no cloud buzzword).
As you could see, this is a common stack for a web-company, the real value and magic rely upon our code:

  • A few bare metal servers virtualizing multiple nodes
  • Master-slave MySQL servers to store our data
  • Elasticsearch cluster for full-text searches
  • Redis for virtual carts, bearer tokens and cache
  • Nginx and HAProxy services to serve HTTP requests
  • PHP 7.0 as first backend language using Laravel/Lumen as frameworks

We were in the middle of a transactional phase.
Moving from an ugly old monolith project written by few of us during the start-up phase, to a state-of-art API specialized services.
Upgrading our languages to PHP 7.2, Kotlin, Swift and VueJs.
With this refactoring we started experimenting RabbitMQ as message broadcaster, sharing messages and responsibilities to our codebase and decoupling logic.
Async pattern helped a lot to reduce the load from our core project, but for sure it was not that simple.
I am going to explain it in the next paragraphs.

Spike of new Users
Spike of new Users in March.

The lockdown.

The 8th of March was the beginning of our journey.
Slack channels dedicated to crash alerts were asking our attention because the MySQL configuration was not able to handle the spike of simultaneous connections to the Master node.
Briefly, we found 3 different issues:

  • the current node was not ready to serve all the simultaneous Read connections
  • too many read queries to retrieve the same static content (e.g.: information about item details such as thumbnails, description, price, weight)
  • DB deadlocks during the saving of data

First of all, do not panic!
It is very simple to scream like hell in those situations, but for sure it is counterproductive.
You can not solve everything on your own, but with the help of your team, you sure do!
Communication should be effective, clear with the right context.
Honestly, we failed at the beginning of the crisis, but after a few hours, we re-arranged into small squads taking care of separate issues.
We saw the light: a small team of highly motivated developers is better than a few rockstars.

While few of us were working on platform stability, another part of the team decided to change our product roadmap, to focus on features that prioritized shoppers and customers safety, frictionless flows in our User Experience.
Fail fast, fail often, recover quickly.
We tried different scenarios to improve our behaviour like contactless delivery and subscription delay until our data was good enough and stable for our users.

How we improved the Database reliability.

A single node for read and write operations was not efficient anymore.
With the help of our DevOps, we decided to create a battery of slaves databases (Replica) to split the load of read queries.
We kept the high availability between Master and Slave Database for write operation, meanwhile for Replica we used a balancer for TCP connections which choose the best available node.
The drawback of this concept is the delay between Master and Replica (average 1s), that is why we used the sticky option to use the current write connection for retrieving data back during the same request.
Another decision we made, was HAProxy as a balancer, with a health check, and a single DNS entry instead of using multiple Database connections to Replica battery and balancing all of those not in a random way, sharing the same entry between our projects.

To reduce commons read queries instead, such as retrieving the same information for multiple users, we decided to use different Singletons to save once and read many.
We stored all the static information into Redis refreshing the expiration every time we had a Cache HIT, similar to LFU with dynamic aging.
We serialized the database payload in Redis as Json, hydrating Eloquent model every time was needed, without adding breaking changes in our workflow.
How about drawbacks?
You know it: Cache invalidation.
We made some internal commands to refresh data from the database, flushing the singletons.

Deadlock issue was a nasty problem.
It was very difficult to replicate it in our staging environment since it is tough to emulate the same traffic and load, thus we tried different solutions in our Canary branch.
First of all, to solve Phantom reads in critical tables like Customer and Order we tuned our queries, using Pessimistic Locking and Isolation only when and where is needed.
Secondly, to reduce the amount of write operation for telemetry tables, we took two different paths:

Database Architecture
Database Architecture: read & write queries

How we decoupled the flow.

In the transactional phase, we started decoupling responsibility.
Honestly speaking, it is hard thinking asynchronously.
You can not expect that you can do it in a snap: since you are adding a new layer, incrementing complexity and adding new bugs like memory leak, server heartbeat, etc.
But in the end, we are all satisfied 😃
We moved from a single thread to a multiprocessing pattern, where we have at least two consumers per node per project waiting for a new event in the queue to handle.

As the first thing, we removed some SOA HTTPS internal requests and Webhooks who did not need a consecutive response.
I am talking about Email, Text Messages, Order and Customer propagation.
This was an important change, and thanks to this we were able to remove a lot of load from PHP-FPM software.
PHP-FPM has a deterministic number of available workers, let’s call it X, this is the number of how many concurrency users you are able to serve at the same time.
Consequently, if you need to call your internal system to send some push notification, you will reduce by half the amount maximum active connections.
Let’s put it in a separate pool, consuming one by one, keeping your server thin for customer requests.

We used the same principle for data ingestion.
Every time you search for something, we save metadata for training purpose of our search-engine.
This is a sequential task, per every search: get the result, store meta-information and show the outcome.
Our Elasticsearch cluster was not ready for the amount of read-and-write at same seconds we had.
We started sampling metadata to give the necessary time to do it, in the meanwhile we decide to follow the Bulk operation way.
The search will produce an Event with the necessary metadata into RabbitMQ, the consumer will retrieve the message, storing the payload into a Redis List.
When the list is full, we could save it with a Bulk command.

RabbitMQ messages
Example of messages shared daily via RabbitMQ.

How we cut some cost.

In situations like this, it is very simple to burn all your Tech Company budget. Keep in mind that some providers, like Google, has a Crisis Plan ready to use, avoiding to hit your budget entirely.
Of course, you have to serve an unpredicted increase of users, but did you think about if it is all necessary?
We did, and we changed our projects informing customers about our SLA.

GoogleMaps is the perfect example of what we did.
Supermercato24 and Szopi are not available (at the time of writing) in every city of Italy and Poland, but some customers tried to see if our service was available or not.
Every time you search for an address on our website, we use Place autocomplete to provide you with the best stores in the area.
We were not able to serve the customers outside our areas, but they had to hit that API first to discover it, attacking our quotas.
To solve this, we introduced a proprietary endpoint to figure out (with less precision) if your area was served or not, before using the standard flow.
Avoid calling Google if that Province or Voivodeship was not accessible.

Another example is our CDN-like project.
We were moving images and static assets from our servers to external buckets like S3 with CloudFront in front of it.
We saw the same trend of the previous paragraph and we decided to use the most hated and beloved concept of every developer.
The Cache.
Per every public object in the bucket, we set the right TTL between S3 and CloudFront and between CloudFront and the customers, accordingly to the object purpose.
You can find everything in the AWS page.

via Meme Generator

What we learned.

You can never be too prepared for certain circumstances.
You should adapt very quickly to new challenges, be agile, fail and recover faster.
Nothing is written in the stone, everything is mutable (except the cache).
You and your team must and can do it.
Do not be afraid.

It was a stressful time, for everyone.
We tried our best to reduce the frictions between shoppers and customers.
We were aware of the social responsibility of our work and we did all the things in our mind to achieve it.
I would like to thank all the colleagues who worked so hard to adapt our products and flows during this emergency.

Ciao! 🚀🚀🚀

Author: @hex7c0

Leave a Reply

Your email address will not be published. Required fields are marked *