SearchBench has received a couple of useful updates since yesterday's initial cloud tests. It can generate search queries based on actual content, and it can export search benchmark results. In gaining these features, it is now possible to use SearchBench to perform some actual performance comparisons.
Once again I set up these tests on an extra large EC2 instance. I still have not performed any tuning, and I continue to test Drupal 5 core search with Xapian search. My initial benchmarks show that Xapian offers a very significant 6x+ performance advantage over Drupal's core search when a given search query actually returns results. In addition, Xapian is able to index a large site in about a 3rd the time of Drupal 5's built in search. Read on for actual benchmark results and graphs.
These tests make it clear that it's important to use legitimate search terms when benchmarking search performance. SearchBench's new ability to extract wordlists from a site's actual content allows the tool to provide much more useful data. Again, note that neither Xapian nor MySQL has been tuned for these results, and that future benchmarks will aim to better understand how various tunings and configurations affect search performance.

Most of these queries did not return any actual search results. The few slow downs you see are because Xapian did return results for some queries.

These are the same queries that were used in the previous test. Note that Drupal core's search did not return results at any time. It would be interesting to compare the queries where Xapian does return results but Drupal core does not, and to fully understand why they the difference in search results.

In this test, SearchBench generated wordlists based on words extracted from actual content on the website being tested. As a result, many of the queries returned actual results, visible in the performance slowdown above.
Some hard numbers from the above test:
| Total tests | 3 |
| Searches per test | 100 |
| Total time | 71.5365 seconds |
| Average time per test | 23.8455 seconds |
| Average time per query | 0.23845 seconds |
| Longest query | 0.66174 seconds |
| Shortest query | 0.12636 seconds |

Thanks to SearchBench, the queries used in this test are identical to the queries used in the previous Xapian test, offering a more precise comparison between the two search solutions. There is an apparent slowdown in Drupal core powered searches when they return actual results. Much of this slow down is likely due to the creation of temporary tables, an issue that has been significantly improved in Drupal 6. This functionality is being back ported to Drupal 5 as an optional patch on which I plan to run additional benchmarks.
Some hard numbers from the above test:
| Total tests | 3 |
| Searches per test | 100 |
| Total time | 433.8613 seconds |
| Average time per test | 144.6204 seconds |
| Average time per query | 1.44620 seconds |
| Longest query | 4.90253 seconds |
| Shortest query | 0.11557 seconds |
The raw search data from the above benchmarks can be found in this Gnumeric spreadsheet.
There are many more benchmarks planned, as detailed in my earlier blog posting. SearchBench is being developed as a tool to better understand search performance and scalability. Tag1 Consulting is focused on defining solid recommendations and best practices for obtaining optimal performance from LAMP-powered search solutions, and on continuing to improve Drupal's scalability.


Twitter
RSS
Comments
Some thoughts
Very interesting - it certainly shows the importance of benchmarking the cases you actually care about!
I've not looked at the details of how Drupal uses Xapian or how its core search works, but my first guess as to why Xapian returns more results would be that perhaps stemming is used there but not with the core search. Second guess would be different tokenisation - i.e. different definitions of what a word is.
If you want to tune Xapian indexing performance, setting XAPIAN_FLUSH_THRESHOLD in the environment to a larger value will speed things up if you've enough RAM - this tells Xapian how many documents to index before automatically flushing changes (default is 10000 currently, which is very conservative - on beefy machines, 1000000 (a million) or more is plausible, though it depends on the number of terms per document).
There aren't really tuning knobs for search performance, but if you compact the database first (using xapian-compact) that should speed things up a bit.
And if you're looking for even more things to profile, it would be interesting to see how Xapian's "under development" chert backend compares. This knocks about 44% off the size of the postlist table (the one most heavily used during searches) for gmane compared to the flint backend in 1.0.6. If you want to try it, snapshot tarballs are at http://oligarchy.co.uk/xapian/trunk/ and you need to set environmental variable XAPIAN_PREFER_CHERT to a non-empty value to get it used by default.
thanks for the feedback!
Yes, I believe you are correct here. The next chance I get to run more tests, I intend to confirm this theory.
I've not spent much time performance tuning Xapian yet, however it does not seem to me that this variable will affect PHP when using a local database, as I was. My understanding is that the PHP bindings will tell Xapian to flush from RAM each time we unset the database variable. You can control how many documents you index before flushing with the Xapian module on the Xapian settings page. (There was an old bug with the module where it was flushing for each and every document, and indexing performance was obviously horrid! This has long since been fixed.)
Perhaps this environment variable will affect things when using a remote search database with xapian-tcpserv?
I'll run some benchmarks and see how much effect compacting the tables has on performance. Thanks for the tip!
That sounds very interesting. Yes, I'll plan to test that out too. Again, thanks for the helpful suggestions!
stemming
Actually, Drupal.org is using the Porter Stemmer module for core search.
stemming
Xapian uses the English (or "Porter2") stemmer from Snowball - perhaps Drupal's "Porter stemmer" is the algorithm as described in Martin Porter's original paper (which Snowball calls Porter).
I think that "skies" is an example treated differently - the original Porter stemmer producing "ski", Porter2 producing
"sky".
stemming
Actually, I should probably clarify that - Xapian has both available, but if you ask for "en" or "english" then you get Porter2 since it generally does a better job...