Sphinx (part 1)

By: Ian Winter

Tags:

  • data-storage
  • indexing
  • mysql
  • search
  • sphinx

As part of our continued growth one of the main problems we came across was how to keep our searches running as quickly as possible so our members could find their perfect match easily. As the application grew some of the search queries running against MySQL were becoming complex and taking longer than we'd like to complete.

In the first instance a few simple solutions were put in place. All data that was required to be searchable was de-normalised into a single table with a few targeted indexes, a background task kept this data "flat" and up to date. This was taken a step further with those de-normalised tables being split be geographic region and network type based tables. This helped, but added an overhead to the process of updating the tables on a regular basis.

There are a few solutions out there that would work as search engines. MySQL queries, Solr and Sphinx - to name just three. We outgrew MySQL queries so had to look at the other alternatives. Solr is a fast, open source search platform based on the Apache Lucene project. Sphinx is a full-text search engine specifically designed to integrate well with SQL databases and scripting languages.

It took a week or so to run some feasibility tests to make sure it both did what we wanted it to and also that we could hook it up to our application which runs on ColdFusion. It became apparent quite quickly that Sphinx was going to be a lot faster to integrate with ColdFusion. A simple compilation of a jar file dropped in the right place, an invoked object and you're off!

At its most basic Sphinx has three parts: a source, the part that runs a sql statement and allocation of query columns to attributes, an index that specifics file storage location, minimum word length, matching rules etc. and lastly the searchd process that makes the index available and does the searching work. The index gets built and once the searchd process is up you can search. The indexes can be rebuilt whilst searches are still running by passing an --rotate option to the indexer which hup's the searchd process and makes your new index available right away.

When you have member data that changes constantly you don't want to rebuild the whole index all the time, it would take a while and is unnecessary. Luckily Sphinx allows the idea of a main + delta scheme or live updates.

Basically you build your base index daily (for example) storing a timestamp of when that occurred then every 5 minutes build a delta index containing changes. When it comes to searching you specify and index, to search the base and delta you just tell Sphinx to search both and it'll sort it out for you. To get rid of anything you don't want, like members who have been removed as scammers, there is a handy killlist option which removes them and build time of the delta. There are other ways to search indexes in a distributed way but more on that next time.

For more information on Sphinx check out the site, documentation and wiki for API examples.


About the Author

Ian Winter

Ian Winter is Head of Technical Operations for Venntro. He manages a team of four engineers who provide 24/7/365 support and are responsible for over 120 physical, virtual, storage and network devices.