2007-09-07

Thinking more about compression

I'm using compression on the URIs and literals in the quad-store to trade CPU for disk/memory bandwidth. The current algorithm is LZW, with a 24bit index look up table. Yesterday's benchmarks indicate it's not doing massively good compression - ; on the other hand it's small and I can see how to preform regular expression searches on the compressed data. 24bit index tables take 64MiB and give around 1:5 compression - reduces to 23%. You can't reset the table like you do in a streaming compressor, since you want the mapping of compressed bits . Bigger tables take more room both - more entries, and the entries become larger as the number of bits to reference another entry increases. If you had 25bit indexes, you'd need 5 extra bytes per entry, so take 154MiB rather than 128MiB and possibly suffer due to cache line mis-alignment.

Since each value in the quad is reduced to a 32 bit identifier in the store, and at least one bit is needed to indicate whether it's fully described by the identifier (uncompressing the identifier's bit-pattern gives the URL), or needs a look-up to longer bit pattern. Obviously, the more URLs which can be squished into 32bit patterns the faster the store will be written to, as it removes the need to look elsewhere to match the string pattern of the value with one already in a hash table to check that the same identifier is only ever used for the same string. But 24+1 bits leaves 7 unused bits of space in half the possible identifiers, which reduces the maximum identifier density to (128+1)/256 = 50.3%, which isn't very good.

Looking at the benchmarks here, there seem to be much better methods than LZW, but most of them are neural net or model based, and I can't see how to map a sequence of hints for a neural net into anything that could have regex implemented on top of it without decompressing. They also are orders of magnitude heavier on the cpu, apparently enough to fade the total performance.

The other options are to not bother with the intrinsic identifiers - which means finding out how many of the URLs would fit in the LZW table for typical datasets (certain of the computer generated ones they would), finding other uses for the spare bits such as 11xx xxxx xxxx xxxx xxxx xxxx xyyy yyyy indicating a 23 bit LZW index and a 7 bit character. Maybe the simplest way would be to have aaaa aaaa xxxx xxxx xxxx xxxx xxxx xxxx where any non-zero in the a's indicates it's a hashkey, so you get full identifier use at the cost of more hashtable lookups.

Holding the full hash in memory isn't an option; even just a population bit would take half a gig. Since each node is single-threaded, it's possible to split URLs from literals and not care whether literals match outside of a smaller MRU table cache; that would increase search times for equality of literals with non-intrinsic ids, but not effect regex searches. It may also be worth maintaining separate LZW tables for literals and URIs. Eliminating duplications can be done in the background anyway.

I think the next experiment will be scanning the datasets I've got - swoogle's ten million triples, wikipedia in a couple of formats (dbpedia and the xml dump), and the uba owl generator - to see what hit rate and spread I'd get generating ids using LSW and hashes.


TME

Labels: , ,

0 Comments:

Post a Comment

<< Home