Yesterday we released a new feature on the platform called "New Members Today" — those members that are interested will receive a daily email containing profiles of new members that may be of interest to them. We've been running this across a sub-section of the member-base, but expanded this to all members.
As with anything, we spent time testing our changes as fully as we could in both development and staging environments. We needed this to be right, as we'd soon be sending emails out to millions of members. Unit tests passed. Integration testing went well. Full-stack tests sent the correct content to the right members. We were good to go: the button was pressed and everything was deployed to production.
At midday, the scheduler kicked things off, and we were in business: items started appearing on the queues, metrics started increasing as messages were spooled, and emails started arriving as our seed addresses. High-fives all round, and it was off to lunch. Then the not-so-good thing happened...
After about an hour, we started getting reports of emails being sent with incorrect content. The example we had had been sent to a male member based in the UK and, rather than containing women from the UK, showed a nice grid of men from South Africa. Not exactly what we'd planned. Sending was stopped immediately and, confused, we started digging in to what the problem might be.
First pass over the code came up a blank; we could not see anything that could have gone wrong. The only thing we did find had such a tiny probability of occurring that we rejected it as impossible. So we started a more systematic process of digging through various codepaths and, one by one, we eliminated options. Once again, the only thing we were left with was the same "impossible, it could never have happened" issue, but this time we felt our definition of "impossible" was drifting much more towards the "unlikely, but possible" side.
How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?— Sherlock Holmes
The now-probable issue was down to how we were generating and caching the content for the emails. Because we send the same content to a number of people, we generate it up front and cache it in Redis (we use Redis over memcached in this case purely because we already have a connection to Redis for stats collection.) Because of the way the cache keys were named, it was possible, but still highly unlikely, that we could get key collisions so one piece of content would overwrite another. For this to happen, two jobs jobs would have to run within the same second, and generate exactly the same 3-digit random number which got appended to the cache key.
Because we'd stopped processing half-way through, we still had unprocessed jobs on the queue. We started pulling these off one by one and inspecting them, and those types of messages which had been rendering incorrectly were indeed pointing to a content cache key that contained the wrong data. Then, due to our stats and logging, we could see that Redis contained one less content cache key than there had been jobs run to generate content. By this point we were convinced that the key-collision issue was the problem, however unlikely it seemed.
We could see that these generation jobs were taking around 0.6 seconds; because we were running multiple workers in parallel, it meant that it was more likely than not for two (or more) jobs to run within the same second. This took the likelihood of a key collision down to the 1-in-1000 based on the 3-digit random key. We may never have noticed this problem: the keys could never have collided and the system would run correctly for weeks or months without anything going awry. It just so happened that this 1-in-1000 chance hit home the first time we ran it.
After kicking ourselves for creating the circumstances where it was possible for this bug to occur, we fixed it within a couple of minutes (removing the randomness from the key generation) and added the extra ability to run within the live environment with a whitelist of recipients. This gave us the ability to run the whole stack in production, but discard anything other than those messages we wanted to see, which meant we could run it in a live environment and be comfortable saying "it works."
There are a number of lessons we've learned from this experience.
- Don't copy or reuse code from another part of the system without thinking about the differences between the two systems. We took the key generation code from another project which would never run concurrently, but didn't consider the differences between how the code would be called in its new environment.
- Testing is not a replacement for code reviews. Although we spent a lot of time on unit tests and then performing functional integration tests, we didn't invest enough time on peer review, which may have caught the possibility of this issue before it occurred.
- Our testing environment wasn't an accurate enough representation of live to allow us to test properly. The particular job that created the conflicting keys took nearly 20 seconds to run in the staging environment, but less than a second in production, meaning that the chances of seeing this bug were much higher in production than in staging.
- Logging can be your proof. We didn't have any logging in place to show what cache keys were being generated. If we did, we could have very easily shown than one was overwriting another, avoiding the time-consuming process of getting to the "this is the only thing left" point, saving ourselves hours of time.
For our business — in fact, in any business — sending hundreds of thousands of incorrect messages to member is a huge deal: members lose confidence in the platform which can increase cancellations and support requests; partners lose confidence in our abilities to run the platform and may reduce their marketing spending until they're happier.
No bugs are "acceptable," but a bug that affects more than a few people can be far-reaching and can impact the business in subtle ways that are difficult to track, which is why we place a great deal of importance on getting things right, first time.