Tag1 Consulting

Performance and Scalability Experts

Tuning Search In Drupal 5

Comments

Compacted databases

Submitted by Olly Betts (not verified) on Mon, 07/21/2008 - 09:28.

Hmm, there seem to be some odd regular gaps in the barchart for the compacted timings - there don't seem to be corresponding almost-zero figures in the raw data, and the graphs in the gnumeric file don't seem to have them. Some sort of image rendering issue?

Anyway, your theory is probably roughly right. If you index data into a fresh database by simply appending documents, then a linear insert mode gets used a lot and the database will be pretty compact anyway.

This is less true for the postlist table as this holds the term -> document mappings which produce more scattered writes, and the output from xapian-compact shows a lot of space being reclaimed there. Partly that will be blocks which aren't currently used at all, the rest will be blocks which aren't completely full. The former probably matter less, especially if they are grouped, as we won't ever try to read them. The latter should make a difference as it should mean we're more likely to have already read in an entry we want along with another in the same block.

I'm not sure why it seems slightly slower though. Perhaps just chance (which blocks happen to be near others so disk seeking times happen to be more). Or perhaps the "faster" code we use for a fully compact database isn't really faster and we'd be better off using the same code all the time! I'll have to check that.

Incidentally, "compact" is more correct than "compress" for describing what's going on - the operation essentially just minimises unused space in the .DB files rather than running any data through a compression algorithm.

Xapian performance

Submitted by admin on Mon, 07/21/2008 - 22:14.

"Hmm, there seem to be some odd regular gaps in the barchart for the compacted timings"

Yeah, I seem to have hit upon some graph generation bug, but didn't have time to dig into it too deeply. Perhaps it would have been better to simply not display that graph on the web page at all as it could cause confusion.

"I'm not sure why it seems slightly slower though. Perhaps just chance (which blocks happen to be near others so disk seeking times happen to be more). Or perhaps the 'faster' code we use for a fully compact database isn't really faster and we'd be better off using the same code all the time! I'll have to check that."

I'd be fascinated to hear what you decide! Fortunately when running xapian-compact you end up with a complete copy of the database, so it's possible to test the before and after and to chose the best one.

"Incidentally, 'compact' is more correct than "compress" for describing what's going on - the operation essentially just minimises unused space in the .DB files rather than running any data through a compression algorithm."

Thanks for clarifying. I suppose it should have been obvious to me, as you didn't call the utility "xapian-compress"... ;)

While we're talking about Xapian performance, if I wanted to share a read-only copy of xapian indexes across multiple servers to allow multiple servers to handle search queries, is it safe to run rsync to keep them all in sync? Or would this risk pushing a corrupt copy for example if rsync runs while the xapian database files are being written to?

Also, is there likely to be a performance boost from keeping these read-only copies of the database on a tmpfs mount in RAM?

Thanks again for all your feedback.

Replication

Submitted by Olly Betts (not verified) on Fri, 07/25/2008 - 23:32.

if I wanted to share a read-only copy of xapian indexes across multiple servers to allow multiple servers to handle search queries, is it safe to run rsync to keep them all in sync? Or would this risk pushing a corrupt copy for example if rsync runs while the xapian database files are being written to?

It's not safe if the database is written to during the sync operation. There's a new replication feature on trunk (but not in 1.0.x) which is aimed at this sort of situation.
The documentation for it discusses other approaches, including using rsync:

http://trac.xapian.org/browser/trunk/xapian-core/docs/replication.rst


Also, is there likely to be a performance boost from keeping these read-only copies of the database on a tmpfs mount in RAM?

If there's sufficient RAM to do that, then if you put the DB on disk the VM system will tend to end up caching the DB in RAM anyway, so probably the only benefit of using tmpfs would be the elimination of the cache "warm up" time just after copying. If there's VM pressure, the VM system may even decide to page out unaccessed blocks from the tmpfs to swap - in some ways the distinction between "in RAM" and "on disk" is a bit artificial as the data will tend to end up where it's most useful either way.

thanks

Submitted by admin on Wed, 07/30/2008 - 12:20.

I'd not found the replication document before, that was very helpful. Thanks!