2007-11-01

One more wide-finder

Since previous efforts didn't print the results summary, they weren't really complete, so here are two more variants using a vector (tbrayA.cpp) and a priority queue (tbrayB.cpp) to get the top 10 entries.

There's a pthread parallel version of the B variant (tbrayB_parallel.cpp). The main thread reads chunks, scans back to a line break, and hands off the chunk to a pool of worker objects, which run the matching in their own thread. Once all the data is read, the workers' counts are summed and the top 10 counts found, then the results printed as in the single threaded version. Each worker's data buffer is written only by the main thread, for the most part in the call to the OS' read function; this helps reduce the amount of copying of data - though whether the T5120 would have to copy anyway I don't know (some CPUs share cache between cores, others don't, those that don't may have to copy the buffer from one cache to the other when the matching thread starts processing).

The code hasn't been built on anything other than linux this time, so I can't guarantee it will run on Solaris. There's also a deadlock on the scheduler if you only use 2 workers (you need at least two so the buffers are only being written or read by one thread at a time), which I have to track down.

Obviously, the multi-threaded version requires quite a lot of infrastructure - it's twice the size of the single-threaded version. It also runs significantly slower than the single-threaded version:

fortinbras:$ time bin/tbrayB_parallel datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real 0m1.358s
user 0m0.372s
sys 0m0.232s
fortinbras:$ time bin/tbrayB datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real 0m0.531s
user 0m0.232s
sys 0m0.156s
Commenting out the call to process_chunk(), so it just does the fork and join thread communications:
fortinbras:$ time ./widefinder datasets/hundred-o10k.ap 3

real 0m0.288s
user 0m0.056s
sys 0m0.220s
So there is some additional cost in the parallelism other than just the inter-thread messaging (372 > 232 + 56).

There's an archive of the code here; unpack it, run make and call ./widefinder with the path of the logfile and the number of worker threads. The number of worker threads is 3 minimum. The other examples compile to the ./bin directory from make tbray.

I'm still not overly convinced that it will be CPU bound if it has has a small, slow, miserable disk unless you are using a language whose virtual machine is excessively baggy; I've been careful not to use any optimisation which is specific to the data, rather than what a compiler for a higher-level language would know, given suitable annotations for parallelism.

The latest erlang effort from Pichi is twice as fast as the original ruby; ruby takes 2.580s CPU time on my machine, so the single-threaded C++ version is 11 times faster than ruby on one core, the parallel one 7 times. So once you have half-a-dozen cores then the erlang should beat the single threaded C++ code. Though I'd like to see whether on the disk on the T5120 means it is CPU limited or IO limited with the faster code, and I'm unconvinced that the T5120 is what future will look like - I'd expect more cache and flash on chip, rather than more cores, and a multi-node rather than a multi-core future.


Pete

Labels: , , ,