Back-off and retry with RabbitMQ

By: Josh Hill

Tags:

  • api
  • node.js
  • rabbitmq

A common issue with third-party APIs is limited availability during peak hours. Even outside peak hours, API requests can be rejected, timeout and fail. In this post I’ll describe a lightweight exponential back-off and retry mechanism for RabbitMQ consumers.


New API, new service

When a user on one of our sites deletes a photo we remove it from our content delivery network (CDN). We have a service-oriented architecture for most of our background tasks and this one is performed by our CDN purge service. The service reacts to events generated by the core web application, sent via a message queue, and calls the CDN API to purge the asset. Recently I updated the service to use the new Akamai CCU REST API.

Since the existing service was a little long in the tooth we decided to rewrite it. As a junior developer it was a great opportunity to work with two excellent tools for background jobs: RabbitMQ and Node.js.

We’re fairly heavy users of RabbitMQ as Paul described in his recent post RabbitMQ: Front the Front Line. Whilst many of our consumers are Ruby-based, we’ve found that Node.js is well suited to consumers with little business logic. The event-driven pattern is a good match for the messaging workflow, and lends itself to fast code. We use the node-amqp package as our RabbitMQ client.

Once the new service was up and running it became clear that the CDN API experiences heavy loads, typically during peak hours 8pm—2am. We soon discovered that we were hitting the API limit most evenings. The API has an internal queue for purge requests and when it cannot accept any more it responds with 507 queue is full.

CDN response pattern over one week

Seeing this pattern of successful and failed API requests made it clear that we needed a robust back-off and retry mechanism for Akamai API requests during peak hours.

Starting point

Our platform publishes several different messages that are routed to the CDN purge queue. The CDN purge service consumes those messages and sends the appropriate API request to the CDN API. This diagram shows the flow of messages from platform to API:

Diagram of our CDN purge process

When the API request is rejected we want to try again later. So what should we do with the message in the meantime?

Let RabbitMQ do the work

To implement a back-off and retry mechanism, my first instinct was to create a new wait queue and put failed requests on it to try again later. Since I was new to RabbitMQ this raised several questions:

  • Will I need a consumer for messages on the wait queue?
  • Can I control how long each message waits before retrying?
  • Can I keep track of how many times we’ve tried an API request?
  • Can I handle multiple platform events on the same wait queue?

Thankfully, RabbitMQ has a number of protocol extensions that extend the AMQP specification. Two of these features provide all the message handling required for a wait queue: dead letter exchanges and per-message TTL.

Dead letter exchanges (DLX)

The term dead letter mail is still used by postal services to describe what happens to mail that cannot be delivered. In RabbitMQ, messages can be dead-lettered when:

  • the message is rejected,
  • the message expires, or
  • the queue is full.

Similar to how a postal service might return a dead letter to the sender, RabbitMQ will do some work for us and republish a dead-lettered message to the exchange of our choice—the dead letter exchange.

Since we want a wait queue, message expiry will be most useful trigger for dead-lettering. We’ll look at controlling when messages expire shortly.

Any queue can be configured to dead-letter messages. The dead letter exchange is a queue parameter. This can be set as an argument called x-dead-letter-exchange when you declare the queue. Here is an example using the node-amqp client:

var queueOptions = { arguments: { "x-dead-letter-exchange": "exchange" } };

connection.queue("wait-queue", queueOptions, function(waitQueue) {
  // Bind to exchange
});

Despite the ominous name dead letter exchanges are normal exchanges with no special configuration. So now we have a wait queue with RabbitMQ configured to dead-letter messages. Next let’s set the expiry for each message, so that RabbitMQ will republish them for us.

More info: RabbitMQ docs on DLX

Per-message TTL

A queue can be declared with a default expiry, or time to live (TTL), for every message. However, to achieve exponential back-off we need to set the expiry of each message individually.

When you publish a message you can set the expiration field in milliseconds:

var messageOptions = { expiration: 10000 };

exchange.publish("routing-key", "body", messageOptions);

This is simple enough, but it means that when a purge request fails our consumer has to make a copy of the message in order to publish it with an expiration field. NB: If you declare your queue with message acknowledgement don’t forget to acknowledge the original message!

var subscribeOptions = { ack: true };

queue.subscribe(subscribeOptions, function(message, headers, deliveryInfo, messageObject) {
  // Post request to API
  // ...

  // If the API request fails
  var messageOptions = {
    appId: messageObject.appId,
    timestamp: messageObject.timestamp,
    contentType: messageObject.contentType,
    deliveryMode: messageObject.deliveryMode,
    headers: headers,
    expiration: 10000
  };
  exchange.publish(deliveryInfo.routingKey, message, messageOptions);
  messageObject.acknowledge(false);
});

You’ll need to make sure you copy all of the details of your own messages. Next let’s increase the expiry each time the API request fails.

More info: RabbitMQ docs on per-message TTL

Handling dead-lettered messages

When a message is dead-lettered, RabbitMQ makes a few sensible changes to it and records the details in a header. For our wait queue we’re only interested in what happens to the expiration field.

The expiration field is removed and recorded as original-expiration in the message’s x-death header. This allows us to find out what the previous expiration was and prevents messages from expiring again. Importantly, the x-death header is an ordered array, so the first record is the most recent.

var expiration;

if (headers["x-death"]) {
  expiration = (headers["x-death"][0]["original-expiration"] * 3);
} else {
  expiration = 10000;
}
// Apply some randomness to the expiration
// ...

In this example the first expiration is 10,000 milliseconds, which is multiplied by 3 each time the message is retried. It’s common practice to randomise the expiry in exponential back-off algorithms. In our case a sprinkle of randomness increased the chance of successful API requests by spreading out the retries.

Next let’s set up our queues so they can manage multiple platform events.

Routing dead-lettered messages

Our CDN purge service reacts to several platform events, each with its own routing key. The easiest way to route multiple routing keys is to declare a separate wait exchange.

With a separate wait exchange you can leave the routing keys alone. So when copies of failed messages are published to the wait exchange you don’t have to change the routing key. Just bind your wait queue to the same list of routing keys as your primary queue on the wait exchange.

var routingKeys = ["routing-key-a", "routing-key-b"];

connection.exchange("wait-exchange", waitExchangeOptions, function(waitExchange) {
  var waitQueueOptions = { arguments: { "x-dead-letter-exchange": "exchange" } };

  connection.queue("wait-queue", waitQueueOptions, function(waitQueue) {
    // Bind wait queue to all routing keys on wait exchange
    routingKeys.map(function(routingKey) {
      waitQueue.bind("wait-exchange", routingKey);
    });
  });
});

connection.exchange("primary-exchange", primaryExchangeOptions, function(primaryExchange) {
  connection.queue("primary-queue", primaryQueueOptions, function(primaryQueue) {
    // Bind primary queue to all routing keys on primary exchange
    routingKeys.map(function(routingKey) {
      primaryQueue.bind("primary-exchange", routingKey);
    });
    // Subscribe to messages
    // ...
  });
});

With this configuration when a message is dead-lettered from the wait queue and republished to your primary exchange the routing keys will stay the same. And it is simple to add or remove routing keys at a later date.

All together now

Now let’s bring all the moving parts together for a lightweight exponential back-off and retry mechanism using RabbitMQ:

var amqp = require("amqp");

connection = amqp.createConnection({ host: "localhost" });

connection.on("ready", function() {
  var waitExchange,
      routingKeys = ["routing-key-a", "routing-key-b"];

  waitExchange = connection.exchange("wait-exchange", waitExchangeOptions, function(waitExchange) {
    var waitQueueOptions = { arguments: { "x-dead-letter-exchange": "primary-exchange" } };

    connection.queue("wait-queue", waitQueueOptions, function(waitQueue) {
      routingKeys.map(function(routingKey) {
        waitQueue.bind("wait-exchange", routingKey);
      });
    });
  });

  connection.exchange("primary-exchange", primaryExchangeOptions, function(primaryExchange) {
    connection.queue("primary-queue", primaryQueueOptions, function(primaryQueue) {
      var subscribeOptions = { ack: true };

      routingKeys.map(function(routingKey) {
        primaryQueue.bind("primary-exchange", routingKey);
      });

      primaryQueue.subscribe(subscribeOptions, function(message, headers, deliveryInfo, messageObject) {
        var expiration, messageOptions;
        // Post request to API
        // ...

        // If the API request fails
        if (headers["x-death"]) {
          expiration = (headers["x-death"][0]["original-expiration"] * 3);
        } else {
          expiration = 10000;
        }
        messageOptions = {
          appId: messageObject.appId,
          timestamp: messageObject.timestamp,
          contentType: messageObject.contentType,
          deliveryMode: messageObject.deliveryMode,
          headers: headers,
          expiration: expiration
        };
        waitExchange.publish(deliveryInfo.routingKey, message, messageOptions);
        messageObject.acknowledge(false);
      });
    });
  });
});

Summary

I’ve shown how to combine two RabbitMQ extensions—dead letter exchanges and per-message TTL—for a lightweight exponential back-off and retry mechanism. The code examples show how to implement this mechanism in Node.js with the node-amqp client. Here is a diagram to visualise the mechanism:

Diagram of back-off and retry mechanism

If you compare this to the first diagram, I hope it will be clear how this mechanism can be used to extend existing consumers that call third-party APIs. In closing, here are brief answers to my opening questions:

Will I need a consumer for messages on the wait queue?

No, let RabbitMQ do the work! Declare a wait queue with an x-dead-letter-exchange argument and RabbitMQ will republish the messages when they expire.

Can I control how long each message waits before retrying?

Yes. But per-message TTL can only be set when you publish a message. So your consumer has to make a copy of the message manually and publish it with an expiration field. NB: If you use acknowledgement don’t forget to acknowledge the original message!

Can I keep track of how many times we’ve tried an API request?

Yes. Every time a message is dead-lettered RabbitMQ records useful details in the message’s x-death header. The first record in the array is the most recent and includes the original-expiration.

Can I handle multiple platform events on the same wait queue?

Yes. The easiest way to manage multiple routing keys is to declare a separate exchange for your wait queue. Then bind your wait queue to the same list of routing keys on the wait exchange.

I hope this helps you handle your API requests and please let me know if you spot any mistakes, or room for improvement.


Stay tuned

My next post describes how we used StatsD and Graphite to monitor API requests and optimise our exponential back-off and retry mechanism.

Follow @globavdev to be notified of new posts and check out our jobs page, we’re hiring.


About the Author

Josh Hill