We recently implemented a new feature on the White Label Dating platform that allows members to see which other members are located near to them.
Part of the requirements for the feature were that members should be clustered together and displayed as a bubble which indicates the number of members who live in that cluster.
In the process of developing the new feature we tried many different method of clustering members to see which worked and gave a natural realistic feel to the site.
One clustering method that you hear about quite a bit is K-means. K-means partitions a population into clusters, such that the members of a cluster are (loosely yet mathematically speaking) "close" to other members of the same cluster, and "far" from members of other clusters. And it's all automatic. Sounds perfect!
Well, except for a couple of things. First, the "K" in K-means is a constant, a constant which you must choose. In other words, you have to know in advance how many clusters you want to end up with. Now, you could run K-means repeatedly, over a wide range of K's, and apply certain measurements to the results to decide which K is the "optimum" K. But that's more than a little slow. Also, you have to choose the starting locations for your K clusters, and the quality of the results can vary quite a bit depending on which starting points you choose. Arbitrarily setting the starting points to a grid across the UK resulted in disappointing results. With a very unrealistic grid like display.
In other words, while K-means is great for some purposes, it's not really automatic enough for our purposes. And it doesn't run quickly either, not when you have a both a large population and a large number of clusters.
Other clustering algorithms
So we surveyed the landscape of other clustering algorithms. There are quite a few. Our criteria were:
- Support multiple zoom levels (so, for example, London would be one big cluster at one zoom level, and break into more clusters as you zoomed in)
- Fairly automated, not a lot of settings to twiddle
- But also configurable enough that we could implement business rules like "at zoom level X, choose clusters so their populations wind up being between Y and Z"
- Produce cluster locations on the map that intuitively "look right" to a person familiar with the area
- Fast enough, against our millions of members, to give us reasonably rapid feedback during our design and tuning process
- Readily available and customisable implementation
We didn't find a suitable candidate that ticked all the boxes, so we designed our own.
A home grown clustering algorithm
When we sat down to build our own clustering algorithm, we kept all of the above in mind and set out to write something that would meet those needs, without necessarily meeting any others. We were not going for perfect generality, or perfect behaviour under a mathematically-defined set of conditions; we were trying to build something that would split the population into nicely-sized, nicely-located clusters, under a set of business rules, and do it very quickly so that we could work with the design team to tweak and refine the system as needed.
So was born the BSPAD algorithm. (All the cool algorithms have obscure-yet-scientific sounding names, so ours does too. Never mind the meaning, I'll just say it's reasonably descriptive.) BSPAD is at heart a recursive subdivision algorithm with customisable logic for determining each partition is "right" and when the algorithm should keep working on it. It's also an inherently fast algorithm, and our implementation included optimisations that made it quite a bit faster still.
For all that, though, the clusters didn't really look natural. They looked a little bit mechanical, a little bit rule-based (which is what they were). Not like something a human familiar with the neighbourhoods would choose. Some of the problem was down to the fact that the algorithm was very young. Some was down to the rules on which the clusters were based being themselves arbitrary. Solvable problems, but while we were working them out we got distracted by something that had been staring at us all along.
During the early phases of our investigation we had coded up a visualisation based on UK postcodes. It was helpful for us to see where our members were on the map, and it provided some insight into, for example, suitable starting locations for K-means clusters. We didn't plan on using it for our actual clusters (it got awfully crowded in urban areas like London for example), but the more we looked at it, the more we realised how good it was: it naturally followed the population, naturally followed city borders and things like that, and as a result just looked "more right" than our other methods did. And it was ready to be used, with very little extra work required.
Here in the UK postcodes are alphanumeric and between 5 and 8 characters long and usually presented as two groups separated by a space. The first group consists of one or two letters which indicates the postal district and the following letters and numbers indicate the postal town. So for example RG would indicate the Reading area and RG1 would be the centre of Reading itself. Other parts of the town have codes that begin with RG2, RG4, RG5, RG6 and RG30.
Smaller nearby towns in the Reading area have the same two letter prefix but a different number so for example RG21, RG22, RG23, RG24 are different parts of Basingstoke.
The second group of characters begin with a number which indicates particular parts of the town or delivery area and the final two letters denoting a property or group of properties within that area.
We soon realised that in the UK we could use this information to plot useful and realistic maps of our members, by using clusters based on the partial postcode leaving out the last two characters. The resulting maps looked realistic and did not reveal exact locations which might compromise member confidentiality.
The release of our work on UK based WLD web sites was very well received and resulted in an increase in full membership applications. We were keen to support the new feature in other countries. Unfortunately postcodes in other countries are not so useful for drawing realistic maps of members.
In Australia postcode consist of a four digit number with the first digit indicating the state. A single code can indicate a number of districts and nearby towns. For example 4870, the code for Cairns in Queensland, also covers the nearby districts of Aeroglen, Brinsmead, Earlville and several others.
In addition Australian postcodes don't always cover a contiguous area and some postcodes cover very large areas. One example is the code 4871 which covers a huge area around northern Queensland more than 1000km from one side to another. It includes Mornington Island near the Northern Territory border and the eastern coastline south of Cairns.
Since the we plot the location of the cluster at a position near the geographical centre of the postcode they may sometimes be slightly different from the centre of population.
The issues highlighted by Australian postcodes are likely in other countries so for the time being we have not released the maps feature in other countries. We are working toward a way to collect better data and hope to do that and release maps for other countries as soon as we have the data available to support them.
Thank you to Michael Mazour for his assitance in writing parts of this article.