Bench testing a new subsystem

By: Michael Mazour

Tags:

  • ruby
  • amqp
  • rabbitmq
  • services
  • testing
  • stress testing
  • celluloid
  • api
  • scaling

Our platform consists of a lot of interoperating systems and services, and we’re adding more all the time. In this kind of environment, development and in situ integration testing of new components becomes tricky as the layers of functionality both upstream and downstream from each new piece grow more numerous and complex. To contain this complexity in the course of building one new component, we built some dedicated apps just to simulate the runtime environment that the new component sees.

This approach of building a system-simulating test bench along with a new subsystem has worked out really well, so I’d like to tell you about it, and why it’s different from the unit and integration testing that we do all the time.

The problem

We’re building a subsystem that watches an AMQP stream generated by various activities on our website. When it sees an AMQP message of interest, it issues a service call or three, gets some results back, and stores them. Simple, right?

Sure, except for one or two little things…

  • There are a variety of different message types we need to watch for, and each needs its own handling.
  • Some types are quite infrequent (but still very important).
  • Some of them are new and aren’t in production yet.
  • Service responses, and response times, can vary. They also might return errors.
  • One of the services is out-of-house and may be subject to extra latency.
  • During development we have to use the out-of-house service in a ‘test mode’ that’s functionally incomplete.
  • Production volumes are high at peak hours so we need serious stress tests.

Ouch. You can see that it’s not going to be easy to test our new subsystem with all the worrisome cases, including the rare/special ones, under load and with realistic return values, unless we can cause the inputs and the services to behave the way we want them to, when we want them to.

Unit and integration tests could give us some of that, but they don’t simulate load or latency very well, and we need to be sure we can maintain throughput ongoingly under stressful conditions.

The test bench

So what we needed was a test bench we could set up to deliver inputs and services to our subsystem in a controllable way to let us repeatably simulate a variety of workloads and system conditions.

We already had most of the pieces, except for two:

  • A load generator

    The load generator sends AMQP a stream of messages, simulating what would be generated by activity on our platform. But unlike the real world, the generator sends whatever load we want, from a trickle to a torrent, and the mix of message types is fully controllable so that we never have to wait for one of the “rare” messages types to come along.

  • A simulator for the out-of-house service

    While the behaviour and performance of our in-house services are well understood, the out-of-house service that our new system talks to is a question mark. It’s outside our infrastructure, so latency will be subject to the Internet at large. And for the moment we’re restricted to a testing mode that doesn’t return the full range of possible responses. Our simulator allows us to get the full range of result cases that we should get in live mode, and to introduce suitably pessimistic latencies and error rates so we can be sure we handle these gracefully.

With these plus one of our standard testing boxes, we have a full simulated production environment for our new subsystem, with a controllable load and a controllable simulated out-of-house service (along with a controllably dodgy connection to that service).

The benefits

Our test bench lets us:

  • Check correctness. We already had unit and integration tests, but the simulator caught some problems those missed.
  • Push the system arbitrarily hard and find its limits (and what happens when we exceed them).
  • Locate and fix bottlenecks.
  • Iterate.

Fast iteration over our scaling problems has been hugely valuable. Current iterations of our work are far tighter than the first correctly-running ones. They run faster, they degrade better, they’re better at recovering when things go wrong, and they log more diagnostically useful stuff.

And we’ve been able preflight a variety of deployment scenarios. MRI vs. JRuby. Number of in-process workers versus number of processes. Balance between different parts of the subsystem. The demands it puts on the rest of our infrastructure when it’s under load. The kind of Ops stuff it’s nice to know before the application actually hits Ops.

Lastly, it’s allowed us to test our system’s dependency on an outside service in a more thorough way than we could have done by using that service’s test mode alone.

This way of working has definitely improved the quality of what we’ll have ready on launch day. I’m confident we’ll be using this approach a lot in the future.


About the Author

Michael Mazour