A common issue with third-party APIs is limited availability during peak hours. Even outside peak hours, API requests can be rejected, timeout and fail. In this post I’ll describe a lightweight exponential back-off and retry mechanism for RabbitMQ consumers.
New API, new service
When a user on one of our sites deletes a photo we remove it from our content delivery network (CDN). We have a service-oriented architecture for most of our background tasks and this one is performed by our CDN purge service. The service reacts to events generated by the core web application, sent via a message queue, and calls the CDN API to purge the asset. Recently I updated the service to use the new Akamai CCU REST API.
Since the existing service was a little long in the tooth we decided to rewrite it. As a junior developer it was a great opportunity to work with two excellent tools for background jobs: RabbitMQ and Node.js.
We’re fairly heavy users of RabbitMQ as Paul described in his recent post RabbitMQ: Front the Front Line. Whilst many of our consumers are Ruby-based, we’ve found that Node.js is well suited to consumers with little business logic. The event-driven pattern is a good match for the messaging workflow, and lends itself to fast code. We use the node-amqp package as our RabbitMQ client.
Once the new service was up and running it became clear that the CDN API
experiences heavy loads, typically during peak hours 8pm—2am. We soon discovered
that we were hitting the API limit most evenings. The API has an internal queue
for purge requests and when it cannot accept any more it responds with
507 queue is full.
Seeing this pattern of successful and failed API requests made it clear that we needed a robust back-off and retry mechanism for Akamai API requests during peak hours.
Our platform publishes several different messages that are routed to the CDN purge queue. The CDN purge service consumes those messages and sends the appropriate API request to the CDN API. This diagram shows the flow of messages from platform to API:
When the API request is rejected we want to try again later. So what should we do with the message in the meantime?
Let RabbitMQ do the work
To implement a back-off and retry mechanism, my first instinct was to create a new wait queue and put failed requests on it to try again later. Since I was new to RabbitMQ this raised several questions:
- Will I need a consumer for messages on the wait queue?
- Can I control how long each message waits before retrying?
- Can I keep track of how many times we’ve tried an API request?
- Can I handle multiple platform events on the same wait queue?
Thankfully, RabbitMQ has a number of protocol extensions that extend the AMQP specification. Two of these features provide all the message handling required for a wait queue: dead letter exchanges and per-message TTL.
Dead letter exchanges (DLX)
The term dead letter mail is still used by postal services to describe what happens to mail that cannot be delivered. In RabbitMQ, messages can be dead-lettered when:
- the message is rejected,
- the message expires, or
- the queue is full.
Similar to how a postal service might return a dead letter to the sender, RabbitMQ will do some work for us and republish a dead-lettered message to the exchange of our choice—the dead letter exchange.
Since we want a wait queue, message expiry will be most useful trigger for dead-lettering. We’ll look at controlling when messages expire shortly.
Any queue can be configured to dead-letter messages. The dead letter exchange
is a queue parameter. This can be set as an argument called
x-dead-letter-exchange when you declare the queue. Here is an example using
the node-amqp client:
Despite the ominous name dead letter exchanges are normal exchanges with no special configuration. So now we have a wait queue with RabbitMQ configured to dead-letter messages. Next let’s set the expiry for each message, so that RabbitMQ will republish them for us.
More info: RabbitMQ docs on DLX
A queue can be declared with a default expiry, or time to live (TTL), for every message. However, to achieve exponential back-off we need to set the expiry of each message individually.
When you publish a message you can set the
expiration field in milliseconds:
This is simple enough, but it means that when a purge request fails our consumer
has to make a copy of the message in order to publish it with an
field. NB: If you declare your queue with message acknowledgement don’t forget to
acknowledge the original message!
You’ll need to make sure you copy all of the details of your own messages. Next let’s increase the expiry each time the API request fails.
More info: RabbitMQ docs on per-message TTL
Handling dead-lettered messages
When a message is dead-lettered, RabbitMQ makes a few sensible changes to it and
records the details in a header. For our wait queue we’re only interested in
what happens to the
expiration field is removed and recorded as
original-expiration in the
x-death header. This allows us to find out what the previous
expiration was and prevents messages from expiring again. Importantly, the
x-death header is an ordered array, so the first record is the most recent.
In this example the first expiration is 10,000 milliseconds, which is multiplied by 3 each time the message is retried. It’s common practice to randomise the expiry in exponential back-off algorithms. In our case a sprinkle of randomness increased the chance of successful API requests by spreading out the retries.
Next let’s set up our queues so they can manage multiple platform events.
Routing dead-lettered messages
Our CDN purge service reacts to several platform events, each with its own routing key. The easiest way to route multiple routing keys is to declare a separate wait exchange.
With a separate wait exchange you can leave the routing keys alone. So when copies of failed messages are published to the wait exchange you don’t have to change the routing key. Just bind your wait queue to the same list of routing keys as your primary queue on the wait exchange.
With this configuration when a message is dead-lettered from the wait queue and republished to your primary exchange the routing keys will stay the same. And it is simple to add or remove routing keys at a later date.
All together now
Now let’s bring all the moving parts together for a lightweight exponential back-off and retry mechanism using RabbitMQ:
I’ve shown how to combine two RabbitMQ extensions—dead letter exchanges and per-message TTL—for a lightweight exponential back-off and retry mechanism. The code examples show how to implement this mechanism in Node.js with the node-amqp client. Here is a diagram to visualise the mechanism:
If you compare this to the first diagram, I hope it will be clear how this mechanism can be used to extend existing consumers that call third-party APIs. In closing, here are brief answers to my opening questions:
Will I need a consumer for messages on the wait queue?
No, let RabbitMQ do the work! Declare a wait queue with an
x-dead-letter-exchange argument and RabbitMQ will republish the messages when
Can I control how long each message waits before retrying?
Yes. But per-message TTL can only be set when you publish a message. So your
consumer has to make a copy of the message manually and publish it with an
expiration field. NB: If you use acknowledgement don’t forget to acknowledge
the original message!
Can I keep track of how many times we’ve tried an API request?
Yes. Every time a message is dead-lettered RabbitMQ records useful details in
x-death header. The first record in the array is the most
recent and includes the
Can I handle multiple platform events on the same wait queue?
Yes. The easiest way to manage multiple routing keys is to declare a separate exchange for your wait queue. Then bind your wait queue to the same list of routing keys on the wait exchange.
I hope this helps you handle your API requests and please let me know if you spot any mistakes, or room for improvement.
My next post describes how we used StatsD and Graphite to monitor API requests and optimise our exponential back-off and retry mechanism.