I've been thinking a bit about garbage collection and stack allocation, and both Robert O'Callahan and phaeron at virtualdub seem to have been.

Chicken scheme seems to have something interesting - a single stack which is collected on exhaustion (rather than C# or Java VMs which have a traditional program stack and a stack based allocator), so I'm thinking about implementing a tail-call-elimination variant of the single stack. That requires escape analysis, which may be costly - in its simplest form, it requires a flag to indicate whether any reference was made to anything on the stack above the point the tail call returns to. Either you can be conservative and set the flag on all function calls where a reference to a stack object is passed as an argument, or you can use a single assignment model and test whether anything was allocated. Making the core single-assignment would possibly complicate supporting other languages, and means you trade space in one condition - consing to hold the assigned value - for space in another - being able to recover some of the stack early in a tail call.

Having just the one stack means that, on x64 at least, you can use the one stack register.

Hmm. These things are all rather hard to do right, and don't have simple 40 line benchmarks.

See also a simple worked example.


Labels: , ,


I hate everybody

Well, every programming language I've ever used anyway.

Not a lot to report, I've faded off of doing the wide-finder thing as the MPI ruby code was about as good as I could expect in elegance, and I'm more interested in other stuff at the moment. It did get me annoyed with C++ again, in that its primary mechanisms for specialisation - template metaprogramming and single dispatch polymorphism - aren't quite a good fit for the problems I want to solve, - the boost::regex version was much slower, and shouldn't be. I want to have good enough performance and clarity and open objects when developing and strongly typed interfaces when building libraries and extensible grammar and access to the low-level machine when bit-twiddling (tbray9.cpp uses a better 8-wide '\n' scan, adapted off this zero-in-word hack; parallelising it with pthreads or MPI is a left to the reader.) I don't want to have to use unhygienic macros; and this discussion of units in measures in Java's verbose, strict, non-expressive generics made me think about kin, my pet lisp+prolog+strong types language again.

So most of this week I've either been thinking how to make fast VList based stack work for a fork+join worker model, or playing with a higher level language which looks a bit like python with strongly type annotations - I prefer infix syntax, Dylan has always been a bit wierd looking to me, and I don't want JavaScript's C-like braces. Python got a bit interesting when it looked like it was drifting towards lisp-land, but then drifted away again.

One thing that struck reading the dylan manual (it's always worth looking at how other languages do things) was its convention that
were the same thing. If that was combined with a strong form of memoization such that writing
    bar(foo) := baz
guaranteed that in future
    bar(foo) == baz
you would get the ability to define dynamic fields in objects, similar to what JavaScript and other associative object based languages offer, as they could be written in the dot syntax as:
    foo.bar := baz
foo.bar == baz
Assuming type specialisation on the function trace path is also cached, lookup becomes a single hashtable access. You do lose locality, but if you want performance you'd have to add it as a field on the object's class, though I hate that - language features shouldn't be used for indirect effects; type safety and performance optimisation should be orthogonal, and lack of locality saps performance (though it depends on context - if you're iterating over lots of objects, accessing the same field, the locality could be better than if you had a conventional C++ (different fields in the same object local, different objects anywhere) or JavaScript (different fields in the same object possibly local, using a hashtable, or similar).

As an aside, recursively rewriting the slot accessor form
== (x.f).g
== f(x).g
== g(f(x))
Which, although apparently missing from the DRM, is the opposite of the normal math interpretation of (g o f)(x) = g(f(x)). Pity. Now I also hate Dylan.

My more recent ramblings about kin are on kintab.net; I've thought a bit about what I'd need to do units and measures nicely, and have been going through SCIP coding up what the examples would look like in infix notation, which is a useful test for conciseness and clarity.


Labels: , , ,


A few more thoughts on concurrency and high-level languages

I little while ago, I started playing with a pattern-matching and relational language for meta-programming called 'kin' (it's got relations, and does pattern, so kin/kinship).

Being a machine elf rather than a language lawyer, I never could come up with a decent user syntax for it; I've toyed with making it a lisp (either L1 like scheme or L2 like common lisp; I wrote something that used lisp s-exprs to manipulate JVM class files, but never got further, and then 1.5 came out and increased the verbosity of the type system without increasing its power significantly, and I moved on), with making it a prolog derivative (I rewrote the core of a Java WAM engine so it ran a few times faster, but never got anywhere much with that either, apart from looking at it and thinking, 'yep, it's faster now, so what do I do with prolog?'); then I thought about a faster JavaScript engine, but Tamarin's handling that well enough for most use-cases, and I want better type inference rather than having to tell the compiler everything. Tamarin also makes the mistake of relying on programmer specified type to give performance; usually the type can be inferred if you have the whole system available, and even if not, the Self and StrongTalk VMs show you don't need explicit type to help the compiler. Type systems should help programmers, not compilers. But the JavaScript story doesn't really matter - the XULRunner front end I spend half my time on at work spends 90% of its time in Mozilla's SVG rendering code rather than my JavaScript anyway. I write my best software when there are users asking me to, and when it's also a problem which interests me; without users I get bored too quickly, without interest I don't give more than the users expect.

I've spent a few hours poking around Rubinius code; they appear to be making a ruby interpreter very like lisp interpreters were a couple of decades ago. I don't program ruby; I do heavy math backends and light front ends (with occasional forays into OpenGL), and never touch RDBMS, so use C++, Fortran and Java backends, C++, Java, XSLT middleware, C++, Java, JavaScript front ends (depending what level of 3D visuals they require; I haven't had cause to look at canvas3D yet so the OpenGL work has always been in C or Java, though given the current Gecko SVG performance I've gone off basing a front end on Mozilla code) - though I'd quite happily never use Java for a UI again (last time I got RSI from all those anonymous inner classes). I've been known to use lisp to prototype math code before coding it up it C++ for speed (though that's mainly been due to working for BAE at the time, who take 6 months to buy software for a 3 month project, or due to other business constraints on the implementation language and isn't a reflection on professionally produced lisp compilers, which are fast enough not to require porting to C++).

An old housemate and good friend of mine did his PhD in an embedded Erlang system; that's as close as I came to it before a year or so ago, when I started using XMPP at work and tried installing ejabberd. It didn't work, and I gave up after a few tries, so I tested our server's federation code against Wildfire instead. That's nothing to do with Erlang; ejabberd appears to scale well in independent tests, and I suspect the Windows installer wasn't a priority (our customers require Windows). I've read comments that there are smart people who have written good software in Erlang. This is true. Smart people also write good software in BASIC or COBOL or ADA; that's because smart people write good software. Writing a language that smart people can write good software in is easy; writing a language that average people write non-crap software in is far harder. For my opinion, for this sort of problem, ruby seems such a language (well, it seems much better than Perl or PHP, and a bit better than python), but the implementation is pants. Conversely, ejabberd scales to many more users than Twitter can, and I suspect is a cleaner expression of the problem, because IM's a concurrency problem and erlang far exceeds ruby's capabilities at concurrency.

Here's some good advice on how a high level language should express a problem in something I started reading earlier:
"The structure of the program should exactly follow the structure of the problem. Each real world concurrent activity should be mapped onto exactly one concurrent process in our programming language. If there is a 1:1 mapping of the problem onto the program we say that the program is isomorphic to the problem. It is extremely important that the mapping is exactly 1:1." -- Joe Armstrong (PhD thesis, Dec 2003).

You have a log file. You scan each line (in any order) for a hit and accumulate the number of times each hit occurs. Then you list the hits in decreasing order of the accumulated hit counts.

I don't see any concurrency intrinsic in the problem. The only reason we're even thinking about concurrency is that we want the implementation of the solution to the problem to be parallelised over many cores for performance reasons, and (as SIMD isn't sufficiently general to scale much beyond 4 times wider paths except in GPUs), many concurrent cores is the only thing we can think that will make computers have more throughput. For this problem, concurrency is as much an accident as memory management. Languages which handle memory management automatically have already eclipsed those which require manual memory management in most areas where it isn't an essential part of the problem. As it's not an essential part of the problem, no language where the concurrency is manifest in the implementation of the solution will be following the structure of the problem. Admittedly, I grew up on BASIC and 6502 assembler in my teens, and if you teach kids pi-calculus in high-school the next generation might see the problem differently.

Neo: What are you trying to tell me? That I can dodge bullets?
Morpheus: No, Neo. I'm trying to tell you that when you're ready, you won't have to.

With Erlang you can express a problem in terms of concurrency.
It may be a long time before ruby's ready.


Labels: , , ,


For this particular problem, the principles are more important that the exact results, but I'm happier with tbray7.cpp which gives the same results as the regex in the original ruby code:
fortinbras:$ time bin/tbray7 datasets/hundred-o10k.ap 
matches: 95400

real 0m0.439s
user 0m0.284s
sys 0m0.144s
The file was in cache for that one so ignore the real time; the important thing is that the user time wasn't adversely effected by getting the 'right' answer, instead of 1100 matches.


Labels: , ,


Of course, it helps if you tell gcc to generate a 64 bit executable

... when you're using 64 bit operations:
tercel-2:$ time bin32/tbray4 datasets/hundred-o10k.ap 
user 0m5.644s
tercel-2:$ time bin32/tbray6 datasets/hundred-o10k.ap
user 0m4.048s
tercel-2:$ time bin/tbray4 datasets/hundred-o10k.ap
user 0m5.725s
tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap
user 0m3.839s
I'd assumed it would default as it does on linux, but no. Which also explains why I was confused about sizeof(size_t) being less than sizeof(long long) here.

It's still around 8 clock cycles per byte - the 5% gain is not enough to alter yesterday's estimates, which is a little encouraging actually, as I wasn't sure whether or not 32 bit Sparc instruction generators, such as gnu lightning (used in rubinius) or whatever erlang's hipe uses would cause a significant slow down.


Labels: , , ,


Some fingers in the air.

I got Steve Vinoski's 2007/09/29 erlang code, installed hipe and the bfile module, and it ran on the laptop:
fortinbras:$ cat ../datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time erl -smp -noshell -run tbray5 main 512 ../datasets/hundred-o10k.ap
110100 matches found

user 1m23.649s
real 1m33.683s
sys 0m1.620s
I'm not sure looking at either mine or Steve's code where the 1101th match comes from - there are #ifdefs in my line splitting code to print the lines, and if you run diff that output with the input it's the same, and if it was related to the last line in the sample it would give a difference of 1 not 100 for the hundred-times repeated file. But something's inconsistent between the two somewhere, and also with the original ruby code which gives 954 matches for the o10k.ap, or 1097 for the regex %r{GET /ongoing/When/([^ .]+) }.

From the gnome-panel dials, the erlang isn't IO bound in total on this machine - for the first few seconds it is running at max IO and 90% CPU, then for the remainder 99% CPU and does zero IO, so it's reading it all into memory then spawning processes to scan it. I'll leave it running overnight on the 10 million line file to how it fares when it can't fit the file into memory, though I doubt that has much of a bearing on what would happen on the T2 box, as that has plenty of RAM.

[Update: It took 52 minutes 26, and was doing disk IO throughout, but that could well have been paging rather than the IO it has to. Nothing to conclude, other than that it doesn't scale linearly - 10 times bigger file takes 34 times longer.]

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real 0m8.622s
user 0m0.284s
sys 0m0.248s

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time ~/projects/quad-store/bin/read_bench datasets/hundred-o10k.ap > /dev/null
real 0m8.754s
user 0m0.000s
sys 0m0.180s
So the 64 bit-wide matcher on the AMD64 laptop is very IO bound; the difference between the total time of just doing the IO and scannning the lines is negligible.

At 2200 MHz, it's scanning 201 MB data in 0.284s, which is 2200*0.284/201 = 3 cycles per byte processed.

Sun's Thumper gives 2GB/s transfer into memory; the T2 runs at 1.4 GHz and assuming the same IO rate, we have 0.7 clock cycles per byte delivered to play with. 64 hardware threads would give about 44.

Running read_bench on one of the 'ancient creaking' Netras, gives:
tercel-2:$ time bin/read_bench datasets/thousand-o10k.ap 
real 1m15.449s
user 0m0.362s
sys 0m27.998s
That's IO bound, 2,095 MB/75.5s = 27 MB/s. The laptop gets a similar transfer rate figure on the same test.

The T2 is 200 times the CPU, and the Thumper 80 times the disk transfer rate; I'll assume that the T2 system's IO is comparable to the Thumper.

tercel-2:$ time bin/read_bench datasets/hundred-o10k.ap 
real 0m6.159s
user 0m0.058s
sys 0m2.834s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap 
matches: 110000
real 0m7.617s
user 0m4.054s
sys 0m2.504s
At 440 MHz, it's scanning 201 MB in 4s, which is 440*4/201 = 8 cycles per byte processed; it's IO dominated but not IO bound. Optimising the matching code would at best give a 20% improvement.

It's also using about 5 cycles per byte system time for the IO.

Since the Netra runs Solaris 10 and is a Sparc variant, and in the absence of better information, I'm working on the assumption that the 1.4GHz T2 would be close to it in terms of clock cycles per work unit, so to process the log files it would also need around 13 cycles per byte delivered. So either we'd need 13/0.7 = 19 threads, assuming each thread provisions another 0.7 cycles worth of work, or run 64 threads and can afford code that's three times slower than the single threaded variant due to concurrency management. If it's closer to the fat-piped AMD64, then only 6 cycles per byte and 8 threads would do.

Guessing even more widely, with a large dataset and large grain concurrency, the C++ matching code should be at least three times faster than the IO - using 8 to 20 of the 64 threads. The erlang interpreter seems to be around 10 times slower , so would require 80 to 200 threads to balance the transfer rate [correction: only the CPU time is relevent here, so it's 83.649+1.620s vs 0.284+0.248s, which is 160 times slower, so would require 1280 threads to get the same throughput]. If the IO rate is less than 800MB/s and the erlang scales across all 64 cores then the faster matching in the C++ won't give any advantage. I've no idea how good erlang is across cores - the only times I've seen it praised is many (conceptual) processes on small numbers of cores - it's a concurrent language rather than a parallel one. Inter-core communication doesn't come free, and any message still need either lock based synchronisation or CAS/HTM retries; as it doesn't matter which actor works on the data, in a C++ implementation you'd use a fail fast lock and try with a different actor rather than it blocking, so going lock-free would cost more than it would gain. AFAIK you don't have that option in erlang, and are stuck with the given queuing and scheduling mechanisms, though you probably could subvert them.

In computer science there are only three numbers - zero, one and many. If my estimates are anything to go on (and I wouldn't bet more than half a penny on any of them), on the target system you need many threads to solve this problem most efficiently.

Looking at the code for the bfile library, it shouldn't be too hard to move the 64bit wide string match into an erlang library, which seems more fun than fiddling with MPI to get the required thread count, but the required thread count is far fewer than the number of processes in Steve's code, so MPI or pthreads should be a better model than large scale concurrency. But erlang may be to concurrency as lisp is to dynamism - I like lisp, but I haven't found been any interesting problems which I could apply it commercially under the constraints of the business, and even though I end up greenspunning in most systems I write, the subset of lisp-like features which gets implemented to give dynamic behaviour is tailored to the problem, and the rest of the code can be performance optimised more easily.

On the other hand, my fiancée comes back into the country on Saturday and I'm supposed to have finished sending out our wedding invitations by then, and all this isn't really helping that to happen.


Labels: , , ,

It's a glitch in the Matrix..

Link: http://www.hemmy.net/2007/09/28/wireframe-toyota-corolla/

Real wire-frame car