As with any decently large Internet-business, we do a lot of background processing of various different tasks. Everything from email rendering, statistics processing, cache flushing and more. Over the last six months we’ve slowly been transitioning a lot of those tasks to a new more efficient architecture using with RabbitMQ as the queue server, and Ruby-based workers to process the actual jobs.
Statistics for processed jobs including job types, average processing times, time spent on queue before being processed and more is all stored in Redis using the Redistat gem. Redis is a real speed-demon, and Redistat provides an easy way to store and retrieve live statistics. The “live” part though does mean writing is heavier than reading, but cause Redis is so crazy-fast, this is not a problem. Except if you’re storing a more-than-crazy amount of data, like we are.
Warning lights go off
Thanks to a change two months ago aimed at recording much more detailed statistics for everything we process, the statistics logging for a single job went from making 20-40
hincrby requests per job to Redis, to about 80-160 on average depending on the type of job. This increase is significant, but wasn’t a real problem back then as our job volumes weren’t crazy. That didn’t last though.
Oh no! Bottlenecks!
Our job volumes are ever-increasing, and around two weeks ago we surpassed 15,000,000 jobs per day. Redis was being hammered constantly with around 15,000-25,000 requests per second, or somewhere north of 1.3 billion per day.
We had two bottlenecks, first and foremost, the network I/O to communicate with Redis was taking about 200-500ms per job, and we’d observed Redis itself becoming a bottleneck during peak hours when we hit around 38,000 requests per second.
Escape from bottleneck-ville (part 1)
To solve the bottleneck issues, we started with separating statistics storage from the actual job processing. Instead of using Redistat to store statistics directly while processing a job, we dumped all data needed to a JSON string, which we published to a
statistics queue in RabbitMQ. Said queue was then being processed by new statistics workers, which only serve to write data to Redis.
This effectively doubled the speed at which jobs were processed. It did mean that statistics weren't live anymore, but at least jobs were being processed fast enough which was the most important part. And we can shuffle resources to and from statistics processing as needed.
Escape from bottleneck-ville (part 2)
Our escape from Bottleneck-ville wasn’t complete though, because all we had really done was shift an equal amount of work off to a different part of the system. For the first week the statistics queue slowly built up throughout the day, and eventually the workers caught up in the wee hours of the morning when things are slow. We knew it was only a matter of time before the workers would never catch up though.
So the next step was to optimize all those requests to Redis by Redistat. In effect, Redistat only increments numbers, and if you tell it to increment a specific number 200 times within a second, it will tell Redis the same thing, 200 times.
The solution was to implement a write buffer in Redistat itself. Instead of telling Redis to increment by 1 twenty times, we now tell Redis to increment by 20 once. The buffer increased our statistics processing from around 200 per second, to 2,500-3,000 per second, allowing us to clear out the 14 million statistics job which had queued up in about two hours.
Scot-free (for now)
We are now processing jobs faster than ever, Redis is seeing less load than it has for months, and pretty much everyone and everything is as happy as a 6 year old with a Happy-Meal. At least till we hit another bottleneck of some kind, which will happen sooner rather than later with our ever-increasing load. ^_^
They say pictures are worth a thousand words, so I'll end this post with a sexy(ish) graph illustrating Redis requests per second over the past week. Can you guess when the write buffer was put into production?
P.S. I will be writing a couple of more posts about our overall background job processing architecture in the near future. They'll include juicy details of how we use RabbitMQ, our custom-built Ruby workers, and lots more.