In previous search benchmarks, I utilized random content generated with Drupal's devel module. In these latest benchmarks, I used an actual sanitized copy of the Drupal.org community website database, with email addresses and passwords removed. The first tests were intended to confirm that Xapian continues to perform well with large amounts of actual data. Additional tests were performed to measure the effect of various MySQL tunings and configurations. The following data was derived from several hundred benchmarks run on an Amazon AWS instance over the past week using the SearchBench module.
These tests confirm that Xapian continues to offer better search performance than Drupal's core search module. Contrary to popular belief, the data also shows that using the InnoDB storage engine for search tables significantly outperforms using the MyISAM storage engine for search tables, especially when your database server has sufficient RAM. The data also confirms that allocating additional RAM for MySQL's temporary tables can also improve search performance.
When reviewing the top searches on Drupal.org, we found that they were generally only 1 or 2 words long, and that they rarely included phrases or term negation. Thus, my first tests compared Drupal 5's core search with Xapian search using similarly simple queries. Each benchmark was repeated three times, and the three tests were validated for consistency. I occasionally returned to a previous test and ran it again to verify that the generated results continued to be consistent each time.
The version of the Drupal.org database that I used was from early March of 2008, and contained 742,885 comments, 223,272 nodes, and 252,189 users. I utilized the core search indexes that were already in the database, but I still had to generate new indexes for Xapian.
- Xapian consistently outperformed Drupal 5's core search using the Drupal.org database.
- Xapian did not increase load on the server as compared to using Drupal 5's core search, though it did increase Apache's memory usage.
- Drupal 5's core search performance benefited from increasing max_heap_table_size to 32M (a 13% speedup). Further increasing max_heap_table_size and tmp_table_size to 64M did not offer another significant performance improvement.
- We saw as much as a 29% performance improvement by converting the search tables from using the MyISAM storage engine to the InnoDB storage engine. InnoDB performance suffered when MySQL ran out of RAM, but it still significantly outperformed MyISAM. This is counter to the commonly accepted belief that Drupal's search module does not perform well when using InnoDB, and needs additional testing.
- Compacting Xapian's database did not offer any perceivable performance improvement.
Drupal 5 Core Search vs. Xapian Search
It took an average of 3.68107 seconds for Drupal 5's core search to run each query in the above benchmark. SearchBench randomly generated a list of 100 search queries derived from content within the database, and then used this list to benchmark the same search queries multiple times. The list of queries was executed three times in a row in this configuration, and each time the group of 100 queries completed in nearly the same exact amount of time, showing less than 1 second of total variation. (The first test averaged 3.676 seconds per query, the second 3.681 seconds per query, and the third 3.686 seconds per query.) This consistency helps to confirm the SearchBench is generating valid data for useful comparisons.
Average time per test: 368.1071 seconds
Average time per query: 3.68107 seconds
Longest query: 25.63190 seconds
Shortest query: 0.16111 seconds
The same 100 search queries used to test Drupal 5's core search were used again to test Xapian. It took an average of 2.23243 seconds for Xapian search to run each query in the above benchmark, a 65% speed up compared to Drupal 5's core search. The same list of queries was again run three times in a row to confirm the reliability of these numbers. The first test averaged 2.177 seconds per query, the second 2.263 seconds per query, and the third 2.257 seconds per query. The second and third tests completed within one second of each other, while the first pass completed nearly ten seconds quicker.
Average time per test: 223.2431 seconds
Average time per query: 2.23243 seconds
Longest query: 11.61143 seconds
Shortest query: 0.16880 seconds
This test was intended to understand the effect of including phrases and negation in some of the search queries. A phrase is when multiple words are joined together with quotes, such as "this is a phrase". Negation is when you search for documents that don't include a specific word by preceding it with a "-", such as "-word". The overall benchmark was completed faster than when not using phrasing and negation, however this is because many of the queries didn't return any results. More telling is that the slowest query jumped up from 25 seconds to 41 seconds! The randomly generated slow search query was quite ugly:
"software site/all/modules chould node/add/xxx/parent/xxx illegal" alter (not "complete?2)"
Average time per test: 325.2230 seconds
Average time per query: 3.25223 seconds
Longest query: 41.26123 seconds
Shortest query: 0.16368 seconds
The same queries used in the above test were repeated with Xapian search, where again there was a perceived speedup caused by many of the queries returning no results.
Note that while this benchmark and the previous Core search benchmark used the exact same list of queries, they do not use the same queries as were used in the earlier benchmarks. Thus, the fact that the average query speed is less could be simply caused by the fact that different queries were used in this test. To obtain a better understanding of how phrasing and negation affects overall performance, similar tests should be run with far more than 100 queries to obtain a more accurate average. In any case, in this test Xapian's slowest query took 12.5 seconds. The query that took Drupal 5's core search 41 seconds took Xapian only 12.4 seconds, a 300% speedup, while neither search actually returned any results.
Average time per test: 204.3366 seconds
Average time per query: 2.04337 seconds
Longest query: 12.52517 seconds
Shortest query: 0.17208 seconds
Beyond raw performance numbers, these benchmarks were also focused on understanding how Drupal 5's core search module and Xapian search each utilized server resources. To collect the following data, I used SearchBench to create a list of tens of thousands of unique search queries, and collected data with Sar while performing back to back searches with each solution for well over an hour.
Drupal 5 Core Search Server Performance:
Visible in these tests, for the first half an hour there was a noticeable amount of disk activity showing up in Sar output as IO wait. Watching the test, I assumed that this disk activity was caused by Drupal 5's known tendency to create temporary tables while finding search results. However, I was perplexed to discover that this abnormally high disk activity stopped after the first thirty minutes. It's possible that after 30 minutes the SearchBench queries were similar enough for MySQL to start caching data in such a way that Disk IO was minimized. This is possible, as all search queries were built from a relatively small list of only 5,000 words extracted from site content. It would be interesting to repeat this test using a much larger list of unique words.
03:00:01 PM CPU %user %nice %system %iowait %steal %idle 03:00:01 PM all 14.63 0.00 2.12 24.85 0.33 58.07 03:10:01 PM all 22.81 0.00 3.93 23.41 0.78 49.07 03:20:01 PM all 30.54 0.00 5.00 16.94 1.60 45.92 03:30:01 PM all 39.30 0.00 5.98 1.68 1.60 51.43 03:40:01 PM all 40.52 0.00 5.55 1.43 1.53 50.97 03:50:01 PM all 39.59 0.00 6.34 1.43 1.69 50.95
Context switches per second
03:00:01 PM cswch/s 03:00:01 PM 709.88 03:10:01 PM 808.38 03:20:01 PM 995.64 03:30:01 PM 866.86 03:40:01 PM 736.75 03:50:01 PM 716.50
Paging and faults
03:00:01 PM pgpgin/s pgpgout/s fault/s majflt/s 03:00:01 PM 2049.73 264.24 749.68 0.23 03:10:01 PM 928.91 3176.36 423.70 0.02 03:20:01 PM 246.67 5967.00 252.97 0.03 03:30:01 PM 86.23 1396.19 270.22 0.00 03:40:01 PM 56.27 1224.52 198.56 0.00 03:50:01 PM 66.05 931.83 229.10 0.00
03:00:01 PM kbmemfree kbmemused %memused kbbuffers kbcached 03:00:01 PM 373432 7491028 95.25 62612 2048496 03:10:01 PM 15392 7849068 99.80 63216 2553356 03:20:01 PM 71968 7792492 99.08 67424 2639488 03:30:01 PM 71788 7792672 99.09 68844 2608772 03:40:01 PM 50140 7814320 99.36 71272 2485512 03:50:01 PM 72624 7791836 99.08 71576 2315728
Xapian Search Server Performance:
As you can see from the timestamps, the Xapian benchmarks were run before the MySQL benchmarks. We can see that Xapian also used caused fairly high disk IO, though not as much as Drupal 5's core search module. It is also interesting to note that Xapian search made much more consistent use of server resources, with less variation between the different time snapshots.
01:30:01 PM CPU %user %nice %system %iowait %steal %idle 01:30:01 PM all 31.55 0.00 6.11 8.42 2.67 51.25 01:40:01 PM all 32.58 0.00 6.69 6.93 2.79 51.01 01:50:02 PM all 34.82 0.00 6.85 4.03 3.11 51.20 02:00:01 PM all 35.16 0.00 6.71 4.06 2.92 51.14 02:10:01 PM all 34.97 0.00 6.89 4.20 2.98 50.96 02:20:01 PM all 35.74 0.00 6.74 2.98 3.18 51.35
Context switches per second
01:30:01 PM cswch/s 01:30:01 PM 878.01 01:40:01 PM 895.38 01:50:02 PM 940.72 02:00:01 PM 875.44 02:10:01 PM 936.37 02:20:01 PM 942.42
Paging and faults
01:30:01 PM pgpgin/s pgpgout/s fault/s majflt/s 01:30:01 PM 283.33 4726.11 114.07 0.00 01:40:01 PM 320.27 4067.61 104.76 0.00 01:50:02 PM 136.67 4598.89 144.10 0.00 02:00:01 PM 126.53 4066.34 131.40 0.00 02:10:01 PM 117.59 5798.11 119.74 0.01 02:20:01 PM 83.78 4413.75 128.38 0.00
Reviewing the following table you can see that Xapian placed additional memory requirements on the server. This is logically explained, as Xapian is running within PHP, adding additional memory overhead to the Apache process, while Drupal's core search makes use of MySQL's existing memory allocations. (Later in this document, we find that giving MySQL more memory can also improve Drupal 5's core search performance.)
01:30:01 PM kbmemfree kbmemused %memused kbbuffers kbcached 01:30:01 PM 41552 7822908 99.47 58172 1593224 01:40:01 PM 72748 7791712 99.07 40460 1439720 01:50:02 PM 74536 7789924 99.05 43808 892112 02:00:01 PM 29564 7834896 99.62 47884 898920 02:10:01 PM 73792 7790668 99.06 48420 867100 02:20:01 PM 71728 7792732 99.09 51416 870336
Drupal 5 Core Search, MySQL Temporary Table Size
A significant performance improvement found in Drupal 6's core search module is a change to table structure causing search to generate fewer temporary tables. With the knowledge that temporary tables are a known performance bottleneck in Drupal 5, I ran some benchmarks to measure the affect of allocating additional RAM to MySQL for creating temporary tables.
Average time per test: 413.3598 seconds
Average time per query: 4.13360 seconds
Longest query: 37.72950 seconds
Shortest query: 0.16072 seconds
Average time per test: 365.7413 seconds
Average time per query: 3.65741 seconds
Longest query: 27.67167 seconds
Shortest query: 0.15986 seconds
Average time per test: 360.4615 seconds
Average time per query: 3.60461 seconds
Longest query: 27.73822 seconds
Shortest query: 0.16011 seconds
Default MySQL Tuning
The first benchmark was with MySQL running with default settings for temporary tables, so tmp_table_size was 32M and max_heap_table_size was 16M. It should be noted that MySQL essentially uses the lower of these twp values, so in this configuration MySQL is allowed to use a maximum of 16M per process for temporary table allocations before it switches to using file-based temporary tables.
MySQL with 32M temporary tables
For this second benchmark, max_heap_table_size was increased to 32M, matching the default setting for tmp_table_size. This configuration change brought us a 13% performance improvement, again comparing the same exact list of search queries. (MySQL was restarted to flush all caches, and I repeated each of these benchmarks a second time to verify this speedup). The improved performance makes sense, as MySQL is now able to create larger temporary tables in memory, without suffering a slowdown by writing the temporary tables to disk. However, be aware that this configuration change results in an per-connection increase of 16M. In other words, with the default MySQL setting of 100 max_connections, we are potentially allocating another 1.6G to the MySQL process. Keep this in mind so that you don't accidentally send your MySQL process into swap, which would result in terrible performance.
MySQL with 64M temporary tables
For the third benchmark, I increased both tmp_table_size and max_heap_table_size to 64M. This gained us another 1.5% performance boost, or a 14.7% improvement compared to MySQL's default tuning. At this point we've already reached diminishing returns. Thus, with the drupal.org data set, I would leave these two settings at 32M. To be clear, this second configuration change increased the potential per-process memory usage by another 32M, or a potential total of 3.2G with the default of 100 max_connections! Due to the minimal gains this caused, it is very likely that there are better uses for this memory.
Drupal 5 Core Search, MyISAM versus InnoDB
It has been commonly accepted by Drupal performance experts that Drupal's core search module does not perform well if the search tables are using the InnoDB storage engine. These benchmarks were intended to both verify and quantify this claim, but instead they disproved the claim. Instead, it was found that you can achieve a significant performance gain by converting your search tables to InnoDB, if your server has sufficient RAM. However, search tables can grow to a very large size, it may often be the case that you do not have sufficient RAM.
Average time per test: 413.3598 seconds
Average time per query: 4.13360 seconds
Longest query: 37.72950 seconds
Shortest query: 0.16072 seconds
Average time per test: 375.5651 seconds
Average time per query: 3.75565 seconds
Longest query: 28.37350 seconds
Shortest query: 0.16093 seconds
Average time per test: 321.9647 seconds
Average time per query: 3.21965 seconds
Longest query: 25.69365 seconds
Shortest query: 0.16256 seconds
Average time per test: 357.4054 seconds
Average time per query: 3.57405 seconds
Longest query: 51.53070 seconds
Shortest query: 0.15946 seconds
For this first benchmark, MySQL was restored to its default settings for temporary tables, and all three search tables were confirmed to be using the MyISAM storage engine. This set the baseline for comparing MyISAM performance to InnoDB performance.
Next, I converted all 221,655 rows of the search_dataset table to the InnoDB storage engine, leaving the other two search tables using the MyISAM storage engine. I ran the same identical benchmarks again, and average performance improved by 10%.
Next I converted all 452,370 rows of the search_total table to InnoDB, and ran the same benchmarks again. Converting this second search table brought us an even more impressive 17% performance improvement, for a total of a 29% improvement over purely MyISAM tables.
All Search Tables InnoDB
Finally I converted all 20,017,829 rows of the search_index table to the InnoDB storage engine. This conversion did not result in a performance improvement as the other conversions had, actually resulting in an 11% performance penalty as compared to the previous test. This was still a 15.7% improvement over running with MyISAM search tables.
While writing up this report, I reviewed the data closer and discovered that this slowdown from converting the search_index table was not due to a limitation of InnoDB, but was in fact due to InnoDB running out of memory. This is not surprising when one considers the very large size of the search_index table, and InnoDB's greater memory requirements than MyISAM. Furthermore, in many cases search query speed actually improved, and it was only some specific queries where performance suffered due to InnoDB running out of memory.
Reviewing this data, I'm left to assume that the reason it has wrongly been believed that InnoDB tables do not offer good search performance is because of InnoDB running out of memory. In reality, if your server has sufficient memory you will likely benefit from using purely InnoDB tables. Even in our benchmarks where we ran out of memory, search tables using the InnoDB storage engine managed to outperform search tables using the MyISAM storage engine.
I plan to run additional benchmarks with a new dataset to confirm these findings at a future time.
My final benchmark looked at server performance from running a sustained search load on a database using the InnoDB storage engine for search tables. I used the exsact same query list described earlier in this document, and collected data with Sar.
Drupal 5 Core Search With InnoDB Server Performance:
08:20:01 PM CPU %user %nice %system %iowait %steal %idle 08:20:01 PM all 42.61 0.00 1.76 3.00 1.77 50.86 08:30:02 PM all 42.69 0.00 1.84 2.60 1.85 51.03 08:40:01 PM all 41.91 0.00 1.77 3.00 1.96 51.36 08:50:01 PM all 41.82 0.00 1.82 3.19 1.96 51.22 09:00:01 PM all 41.92 0.00 1.92 3.11 1.94 51.11 09:10:01 PM all 41.48 0.00 1.76 3.43 1.86 51.46 09:20:01 PM all 41.49 0.00 1.98 3.63 1.63 51.26 09:30:01 PM all 40.92 0.00 1.76 3.87 1.79 51.66
Reviewing the above data, it's clear that using InnoDB for search tables does not result in an increase in CPU utilization, as these numbers are very much in line with what we saw earlier when using the MyISAM storage engine. Once difference is that we did not experience a Disk IO storm as we did during our MyISAM tests. This could be an anomaly, or it could be the result of InnoDB making better use of RAM, having already loaded much of the search tables into memory.
Compressing Xapian's Database
A comment posted to an earlier article suggested that we try compressing Xapian's database files to improve search performance.
Average time per test: 216.1539 seconds
Average time per query: 2.16154 seconds
Longest query: 5.51119 seconds
Shortest query: 0.16965 seconds
Compacted Xapian Search Performance
Next, I used the xapian-compact command line utility to compact the Xapian databases. This resulted in the following output:
postlist: Reduced by 56.8097% 350552K (617064K -> 266512K) record: Reduced by 1.05708% 40K (3784K -> 3744K) termlist: Reduced by 5.88804% 13152K (223368K -> 210216K) position: Reduced by 0.0104678% 64K (611400K -> 611336K) value: Size unchanged (0K) spelling: Size unchanged (0K) synonym: Size unchanged (0K)
Uncompacted Xapian Search Performance
First I set a baseline by running some benchmarks on an uncompressed Xapian database.
The reduced size was also visible at the filesystem level:
du -shr xap* 1.4G xapian.orig 1.1G xapian.compact
Strangely enough, this did not offer any improvement in search performance, instead showing up as a slight performance penalty. Reviewing the data closer, I realized that the first of our three benchmarks suffered a penalty as this was the first time Xapian had used the specified database files. The second and third benchmarks had very similar results compared to the uncompressed files, as summarized below. In conclusion, with the Drupal.org dataset we did not measure any performance improvement by compacting the Xapian database files.
It seems likely that the reason compacting the database did not improve performance is because we were using a static dataset. My theory is that if content was being updated, deleted, and added as it would be on a live website, and the Xapian database was being updated each time, then compacting the database may actually have a beneficial effect.
Average time per test: 218.5537 seconds
Average time per query: 2.1855 seconds
Longest query: 5.58208 seconds
Shortest query: 0.16979 seconds
Where do we go from here?
The most important thing I learned from running these benchmarks is to stop recommending that the MyISAM storage engine be used for Drupal's search tables. I will be running further benchmarks to validate these findings, and to explore configuration settings that can help InnoDB's search performance.
This data was also quite useful to confirm that Xapian can help scale search on a large Drupal website with lots of content. I would be interested in learning more about how Xapian search performance can be further optimized.
In future benchmarking I will be comparing Drupal 5's core search module with Drupal 6's core search module. I would like to validate the claim that Drupal 6 offers improved search performance, and to review MySQL configurations that can further optimize its performance, as research to my online performance and scalability book.
Finally, I soon plan to broaden my research to include other third party search solutions, including both Sphinx and Solr.