Tag1 Consulting

Performance and Scalability Experts

Comparing Xapian and Drupal 5's Core Search

Comments

Some thoughts

Submitted by Olly Betts (not verified) on Fri, 07/11/2008 - 01:26.

Very interesting - it certainly shows the importance of benchmarking the cases you actually care about!

I've not looked at the details of how Drupal uses Xapian or how its core search works, but my first guess as to why Xapian returns more results would be that perhaps stemming is used there but not with the core search. Second guess would be different tokenisation - i.e. different definitions of what a word is.

If you want to tune Xapian indexing performance, setting XAPIAN_FLUSH_THRESHOLD in the environment to a larger value will speed things up if you've enough RAM - this tells Xapian how many documents to index before automatically flushing changes (default is 10000 currently, which is very conservative - on beefy machines, 1000000 (a million) or more is plausible, though it depends on the number of terms per document).

There aren't really tuning knobs for search performance, but if you compact the database first (using xapian-compact) that should speed things up a bit.

And if you're looking for even more things to profile, it would be interesting to see how Xapian's "under development" chert backend compares. This knocks about 44% off the size of the postlist table (the one most heavily used during searches) for gmane compared to the flint backend in 1.0.6. If you want to try it, snapshot tarballs are at http://oligarchy.co.uk/xapian/trunk/ and you need to set environmental variable XAPIAN_PREFER_CHERT to a non-empty value to get it used by default.

thanks for the feedback!

Submitted by admin on Fri, 07/11/2008 - 15:28.

"my first guess as to why Xapian returns more results would be that perhaps stemming is used there but not with the core search."

Yes, I believe you are correct here. The next chance I get to run more tests, I intend to confirm this theory.

"If you want to tune Xapian indexing performance, setting XAPIAN_FLUSH_THRESHOLD in the environment to a larger value will speed things up if you've enough RAM"

I've not spent much time performance tuning Xapian yet, however it does not seem to me that this variable will affect PHP when using a local database, as I was. My understanding is that the PHP bindings will tell Xapian to flush from RAM each time we unset the database variable. You can control how many documents you index before flushing with the Xapian module on the Xapian settings page. (There was an old bug with the module where it was flushing for each and every document, and indexing performance was obviously horrid! This has long since been fixed.)

Perhaps this environment variable will affect things when using a remote search database with xapian-tcpserv?

"if you compact the database first (using xapian-compact) that should speed things up a bit."

I'll run some benchmarks and see how much effect compacting the tables has on performance. Thanks for the tip!

"if you're looking for even more things to profile, it would be interesting to see how Xapian's "under development" chert backend compares."

That sounds very interesting. Yes, I'll plan to test that out too. Again, thanks for the helpful suggestions!

stemming

Submitted by admin on Wed, 07/16/2008 - 15:25.

Actually, Drupal.org is using the Porter Stemmer module for core search.

stemming

Submitted by Olly Betts (not verified) on Thu, 07/17/2008 - 23:05.

Xapian uses the English (or "Porter2") stemmer from Snowball - perhaps Drupal's "Porter stemmer" is the algorithm as described in Martin Porter's original paper (which Snowball calls Porter).

I think that "skies" is an example treated differently - the original Porter stemmer producing "ski", Porter2 producing
"sky".

stemming

Submitted by Olly Betts (not verified) on Thu, 07/17/2008 - 23:07.

Actually, I should probably clarify that - Xapian has both available, but if you ask for "en" or "english" then you get Porter2 since it generally does a better job...