Tin Can Camera: 2007-04

Rich Application Platforms

Well, there's been a bit of a kerfuffle about whether or not Silverlight (nee WPF/E) supports binding, widgets and XMLHttpRequest gubbins. And Adobe has open sourced Flex, as well as having had its Apollo platform out for a bit now. XULRunner is still going, being used for SongBird and Joost, and freshly the DVD version of Wikipedia, though whether or not Mozilla is promoting it well as a platform is under discussion.

Having done an awful lot of XULRunner programming in the last few months, there's still a lot to be done before its an easy to use platform. Quite a lot of undocumented interactions cause crashes - particularly race conditions between XBL bindings being attached and scripts in the body of the document being run, and any manipulation of the tree in script during an XBL constructor. But I don't have any idea how well Silverlight or Apollo handle such things, or even if they support anything as flexible as XBL. There needs to be an XULRunner 2.0 that is a lot more predictable in its behaviour, and less arcane in its invocations.

There's also a big question for the sort of system's I'm interested in - can a platform do something like the ClusterBall WikiPedia visualisation? XULRunner SVG grinds to a halt at a thousand or so arcs, and rotated font anti-aliasing is a bit off. Flash, using Haxe, can render ten thousand arcs, but getting text to rotate is a hassle - it requires embedded fonts, and that means more tools, you can't just set the rotation property. There's quite a lot of weirdness in Flash, half of it is object view graph based, the rest procedural, not everything works and it doesn't play nicely with the browser, for example capturing keyboard input and not being part of selections. As far as I know, there isn't an open high-quality 2D graphics library other than anti-grain 2.4 which will scale and isn't, well, a bit weird. But anti-grain has moved to GPL, and isn't hardware accelerated at 2.4, so what will become of it I don't know. I'd really like a good, object oriented language, with a mix of dynamic and static typing, support for mesh tuple spaces, event queues, first class functions and relations, orthogonal persistence, which runs fast enough, and generates portable, high quality visualisations and information displays. None of the options are quite there yet, but XULRunner's JavaScript and SVG seem to be the most sane.

Pete

Labels: comweb, javascript, programming, web, xul

The curse of the golden flower

This juxtaposed with this.

Jing ke ci qin wang, Hero and The Curse of the Golden Flower, seems the Chinese authorities like supporting films about an invincible leader. I got tired of eye candy, and the massacres weren't entertainment. Wipe them out and continue as though nothing happened. Erase democracy from the Google cache.

Pete

More rdf bits

I've spent a little more time on quad-store, and found some more realistic performance figures from Franz Inc and the ESW wiki.

There's now a reasonably fast in-memory sorting (using radix 256), around 2.2s/10 million triples, which should help indexing small stores. I also have had a look at MPI for linking nodes together.

I think the premise may well hold - as the limiting factor for growth is IO and memory rates, not CPU, so trade CPU for compression and limit the number of values a store can hold, so making the store's representation smaller, reducing IO. Structure for locality and divide the store into nodes when the limits are exceeded. I still haven't found a large enough dataset to push the 2³⁰ values it can hold. And 2³⁰ values is up to 2⁹⁰ triples, as each triple has 3 values. 2 bits of the 32 are reserved to indicate whether it's a compressed URI, or a hash table URI, a blank or an literal.

OpenLink Virtuoso also seem to be using something similar in their 'bitmap indexes', though limiting to 32 bit values gives you a set of sorted b-trees of blocks of quads, rather than indices - in a pure index you have to look up the value, so incur an indirection.

I'm not yet bored with playing with this, as I am learning quite a bit about how bandwidth limited systems behave, whereas most of my previous work has been cpu limited numerical stuff, distributed services, or km and visualisation apps.

Pete

Labels: hpc, programming, quad-store, rdf

MegaData-QuadStore micro benchmarks

2007-04-05 21:36
Starting from Joe Gregorio's BitWorking post on 'megadata', I'm wondering what happens if you work this thing from the other end - rather than trying to make a database that scales well over distributed compute nodes, instead focus on optimising each node whilst trying not to compromise external scalability.

Getting parallelism working, whether via network or just concurrent threads in the same machine, is harder than straight-ahead computing in a single node.

Since my background is in engineering computing and simulation (with a little KM and recreational prolog), I've little idea why it takes 15 minutes to load 10,000,000 triples into a RDF store , whereas reading a 150 MiB file itself takes seconds.

So I'm trying to find the cost points for nodes, rather than looking at distribution schemes here.

For this, I'm taking a node as a virtual processor and half a GiB RAM, partly because my laptop has a single non-hyperthreaded core and 640 MiB RAM, so is a good model for a single node in a server array. A quad core hyperthreaded server with RAID should be eight nodes -eight times my laptop in Mips, disk speed and RAM.

At that size, a node doesn't have to handle more than a billion tuples, and I'm assuming a little bit that some of the strategies for distributing queries over nodes will still apply to such nodes. Limiting a node to a billion tuples also makes coding easier - you can use 32 bit identifiers, as long as they are reasonably local to the node. By 'reasonably local', I'd expect some number of IDs to be common to all nodes to represent the most frequently occurring term values, but distributed queries not to have to rely on having unique IDs that span between nodes.

Most of the RDF stores seem to be written in Java, and what I've seen of the implementations do a lot of pointer chasing, so this is a bit of fun to see what we can do with SSE2 to make a RDF-like query engine without pointer chasing.

If each tuple in the store is a quad vector of 32 bit ints, we can do quite fast searches using SSE2. As John Sowa predicted, many triple-stores have found that triples aren't enough, and are quad stores now. Though maybe a hexa-store is best for human-oriented data, as most of the relations we come up with have arity less than seven.

The log of my first week's hacking is available here.

Importing from the 1716 MiB SQL dump mentioned in the BitWorking post's comments, using a compression schema and a string table to generate a 160 MiB table of around 10 million triples and keep the full form of each quad within the memory limits takes around 2 minutes, around twice disk throughput speed.

Getting that far took most of Good Friday and Low Saturday, then I went to my sisters for the rest of Easter.

Coming back and coding in the evenings this week, I was playing with merge sort on the quads. Sorting one term over a 10 mega-quad table takes around 9 seconds, which is around 24 passes with merge sort, so visiting 28 million quads a second at ~85 clock cycles per quad (clock speed is 2.4GHz). That does seem a little high, as there's only around 20 instructions in the lowest level loop, though there is a branch. Using SSE to eliminate the branch costs more than the simple C variant.

Thinking about indexing next, it's noted here that
to generate indices where all quads matching two terms are contiguous, you need an index for each selection of two terms from four, ie six indices.
Let P, O and G denote the four terms in the quad - subject, predicate, object and graph.
For one or three term queries, four indices suffices - for one term, one index for each of SPOG, for three terms, one index sorted on the three terms in any order.
For a query of four terms, any index will have the matches contiguous.
So for quads, six indices will give you an O(log(N)) match on any single part of a query, as long as SPOG appear first and last in at least one of the indices.

So a selection such as these might be used:


 SPOG -> S, SP, SPO matches contiguous, SP matches ordered by O over G.
 POSG -> P, PO, matches contiguous, PO matches ordered by S over G.
 OSPG -> O, SO, matches contiguous, SO matches ordered by S over G.
 GSPO -> G, SG, SPG matches contiguous
 GPOS -> PG, POG matches contiguous
 GOSP -> OG, SOG matches contiguous

Based on the assumptions I've made about what a node in a distributed array needs to provide, ie 32 bit IDs is enough, and how well the sample quads compress, these don't have to be indices - a reference into a file takes as much space as the data itself - instead they are the IDs. I'm using these for direct matches, not comparisons such as SPARQL filter provides; you'd filter a potential, or create a union of a small range (such as 2 < x <= 5 translates to x=3|x=4|x=5 for integers).

I'd expect that most of the time you'd want the results around a single subject, particularly if you have lots of blank nodes, ie you'd be using one ordering to get ?PO_, then use SP?_ (? desired term, _ don't care, SPOG specified value).

Taking the example query from http://xtech06.usefulinc.com/schedule/paper/18 :


SELECT  ?pub ?title ?issue ?article
WHERE {
  ?title    rdf:type        struct:Title .
  ?title    dc:identifier    .
  ?issue  prism:isPartOf   ?title .
  ?issue  prism:volume  "20" .
  ?issue    prism:number    "4" .
  ?article  prism:isPartOf   ?issue .
  ?article  prism:startingPage  "539" .
}

An optimiser may first look at the POGS and OGSP orderings to determine the spans of the predicates and literals to determine which patterns to match first.

Either you can build a topological sort and start with finding ?title candidates, then ?issue candidates, then ?article, or you can merge the results of all matches with one ungrounded term.

The thing about this index approach is that this pattern:

  ?title    rdf:type        struct:Title .

would use the POGS index, and the result would be far from the result for

  ?title    dc:identifier    .

This lack of locality, I think, is going to be a pain point for this kind of normalised store.

Whether it's an index with pointers into a larger store, or an ordering of a compressed form of the quads, you've got a cache (disk or memory) miss, which is why I'm trying to implement this without chasing pointers.

But so far I'm happy that it's possible make a midi-store node that's good for a billion quads, and then distribute a mega-store between them using a higher level protocol than that which is optimised for the node to use internally.

TME

Labels: programming, quad-store, rdf

Swimming the Atlantic

Link: http://maps.google.de/maps?f=d&hl=en&saddr=newport,+wales&daddr=Newfoundland,+Canada&layer=&sll=44.840291,-40.605469&sspn=73.415538,119.355469&ie=UTF8&z=3&ll=46.920255,-34.628906&spn=71.370804,119.355469&om=1

The strange thing about google map's directions from Newport, Wales to Newfoundland, Canada is that, like Bern to New York, you depart for America from Quai Frissard in Le Havre, swimming, whereas you just continue with a slight right turn over the channel.

TME

Gpu and meta programming

This last week I've been looking a bit at some of the gpu toolkits that I mentioned last week. Mostly that has concerned getting used to building them, since my home computers done have high enough spec graphics cards to run them. I'm getting tempted to buy a shuttle to play with, but like using my laptop more than a desktop, and it seems a waste buying a top-flight graphics card then interacting with it via a laptop and ssh (which is how I often use my Mac).

I've also started porting a large program from Windows to Linux so I can operate on it using oink static analysis tools and Mozilla's scriptable metaprogramming tools. Actually it only needs to preprocess on gcc rather than be ported, but it would be nice if I could run it at home.

Also this got me thinking about how to do an efficient triple-store.

TME

(removed link to chocolate bunnies pic)

Labels: gpgpu, metaprogramming, programming

Language of the year

For me, this year's language is assembly. I did a little 6502 assembly as a teen - wrote a dissembler for the BBC B's OS and mucked around with its interrupts in high school and some 6800 during university, and did a tiny bit of 386 inline assembly for loop optimisation of Gouraud shading parts of my MSc in 1992, so it's not quite a new language, but it's one I haven't used in a few years.

It seems that there is more assembly around these days - some marginally disguised such as C++ SIMD intrinsics, but quite a bit of the lock-free code I've been seeing is assembly. Perhaps I'm just looking at more parallel and concurrent code. It seems the high level languages we currently have don't have abstractions which match multi core architectures, and so we have to make our own low-level libraries to map them to them.

There's also big gains in using GPUs for parallel computation, which is again getting closer to the hardware, though there are C and C++ tool kits for GPUs out there. The last few years I've looked at languages which have been higher - Python, Common Lisp, ML, JavaScript. Fortress stands somewhere else - it's a very high level language, but very careful to have a rich enough type system to allow a strongly optimising compiler for parallel hardware. I still don't believe that type systems have to be static to be optimised, but they do need to be strong.

Targeting these new architectures, the PeakStream platform looks interesting - though it seems to only target workstation GPUs, and may not be a platform (which I take as synonymous with virtual machine) but rather a set of GPU optimised libraries. As my only home machine is a laptop without I don't have machines which have its target GPU or one for NVIDIA CUDA. The Sh shader metaprogramming library also looks like the sort of thing that may be useful - it uses C++ operator overloading, much like Boost lambda or RealLib to generate an AST of the expression, which is then calculated on either a CPU or a GPU backend (RealLib expressions are calculated using the CPU, sometimes with SIMD intrinsics, to required precision). I'm interested as to how Fortress' parallel for might get implemented on a GPU, and whether it's still necessary to detect glitches in gamer rather than workstation class GPUs - traditionally the gamer class GPU's ran hotter so weren't guaranteed to produce accurate results all the time, but there's no mention of that in recent generations.

This post is delayed from Thursday as I was off in York, finding a venue for my wedding reception. My lovely lass stayed with me for a fortnight, but now she's back in Glasgow :(.

TME

Labels: assembly, concurrency, fortress, programming

Tin Can Camera

2007-04-26

Rich Application Platforms

2007-04-23

The curse of the golden flower

2007-04-20

More rdf bits

2007-04-12

MegaData-QuadStore micro benchmarks

2007-04-10

Swimming the Atlantic

2007-04-06

Gpu and meta programming

2007-04-01

Language of the year

About Me

Previous Posts

Blog Roll

Archives