Sometimes things go wrong. The results of the "something" going wrong are usually fairly obvious — searches not returning, systems not starting, payments not processing — but working out why that "something" stopped working (or started working differently) is the challenge.
Earlier this week we noticed that for the previous week or so, we'd been making a huge number more lookups to an external system for retrieving the latitude and longitude for members' postcodes (which we use for distance calculations between members,) but we couldn't place our finger on what had happened: no one had changed the lookup code in months. We permanently cache all data locally to avoid excessive remote calls, and that data also remained intact.
Looking at what we had changed, we had a hunch that the symptom was caused by our new member profile pages: it was the only significant release during the relevant period. But that was a hunch, a guess. Until you know what the problem is, you can't fix it. So we quickly threw up some additional logging around calls to the external service to track where they were coming from. We found the lines of code: it was the new profile page. It turned out that the new work had not created a new bug, but by changing the behaviour, had uncovered an already-existing problem that hadn't been big enough for us to notice before.
When we were doing a distance calculation between two members, if we didn't have lat/long data for a member's postcode, then we made a call to the external service retrieve it. But what if the external service didn't recognise the postcode either? The PAF may be "the most up-to-date and complete address database in the UK" but it still doesn't have everything: new housing estates are going up all the time. Our answer: we don't record anything, and return an answer of "we can't tell how far apart you live." Seems reasonable.
It is, until you realise that the new profile page runs up to four times as many distance calculations as the old one (we now provide a list of "similar" members at the bottom of each page.) So if the member looking at a profile doesn't have the correct location data: yes, we looked up (and failed to retrieve) this data four times in a single page view.
Once we'd found the issue, it was a simple enough matter to deal with it: we now only check for location data when an unrecognised postcode is entered by a member. You can see from the chart to the left that once we'd identified and fixed the problem, lookups via the external service dropped immediately: not just back to previous levels, but to almost nothing. Can you spot what time we deployed the fix?
Expect the unexpected
Not only should you expect it, but you should know your system well enough to deal with it. It goes without saying that you should be using some form of source control for all your code, but you should also we clearly tracking what's deployed and when, and communicating that to everyone within the team. Regular code reviews throughout the whole team are important too: not just senior developers pre-commit reviewing a junior's work, but those junior developers reviewing the senior's work to increase their understanding of the platform they're working on.
You can build the biggest test suite you can around your application, but even if you aim for 100% coverage in unit tests, integration between components can throw up things you just weren't looking for. In the example above the code was doing exactly what it was supposed to do; testing for requirements errors, especially when those requirements change gradually over time (or with growth), is near on impossible. Something that seemed reasonable when running a single site with a handful of users suddenly becomes problematic when you're running thousands of sites with millions of members.
Measure, monitor and log everything
Any monitoring you can do is worth its weight in gold when something goes askew. And when you're not monitoring or logging something that has gone wrong, the ability to quickly start logging is of paramount importance. If you notice a problem but can't make any production changes until the next scheduled release in two weeks time, you've got problems. Etsy's StatsD is a great example of how they've created something which allows their developers to start collecting data within a matter of minutes.
We've recently been putting a lot of effort into measuring and monitoring everything that goes through the WhiteLabelDating platform to improve our visibility of what's going on within the running system, and get earlier warning signs of something going wrong. As well as pre-existing systems such as Munin and Nagios, we're also building on new components where necessary, like Redistat, but that's a tale for another post...