Sphinx (part 2)

By: Ian Winter

Tags:

  • data-storage
  • search
  • sphinx

(This article is the second part in a series. Read Part 1)

Having used Sphinx for a while, we found there was still room for improvement. Our new search engine worked but wasn't as quick as we would have liked under load. Initially the index was built along a one-index-for-all approach, containing every possible field that we may or may not have wanted to search on. Its primary use was to return searches to the main application, however there were also utility applications that needed search such as our newsletter tool.

The change we made to improve performance was fairly simple. We analysed which attributes we were actually querying against and removed all the attributes that were unused. This in itself would have probably been enough, however it still left quite a large index in terms of how many columns were in the query and led us to the next step of splitting out indexes. We tried using type based indexes, for example:

src main_search
{
  ...
  sql_attr_uint = id
  sql_attr_uint = age
  sql_attr_uint = gender
  ...
}

src newsletter_search
{
  ...
  sql_attr_uint = id
  sql_attr_uint = age
  sql_attr_bool = emailpref
  ...
}

index main_search
{
  source = main_search
  path = /path/to/main_search
}

index newsletter_search
{
  source = main_search
  path = /path/to/main_search
}

This new approach had a number of significant benefits:

  • The individual index size was reduced as unnecessary data was no longer stored
  • The index rebuild time was reduced
  • We gained the ability to take specific indexes offline for maintenance
  • The end-to-end search time itself was reduced

Next time: using distributed indexes with Sphinx.


About the Author

Ian Winter

Ian Winter is Head of Technical Operations for Venntro. He manages a team of four engineers who provide 24/7/365 support and are responsible for over 120 physical, virtual, storage and network devices.