Data-wide wide-finder

Running a test with just breaking into lines without the matching, compared to the non-MPI version with matching runs on the 100x file in 380ms user time rather than 720ms. So there is work in both parts of the string processing.

Parallelising the scan for newlines to work on 8 chars at a time gives a reduction to around 220ms.

Restoring the matching, and parallelising the scan for the first character makes the CPU run at 800MHz.

A quick google search tells me that sudo dpkg-reconfigure gnome-applets enables me to use the gnome panel applet to lock the frequency at 2200MHz, so with that set, I can compare results between the linear and SIMD parallel forms. Typical results are around 400ms better for the parallelised version on the laptop:
fortinbras:$ time bin/tbray4 datasets/hundred-o10k.ap
matches: 110000
real 0m10.420s
user 0m0.688s
sys 0m0.268s

fortinbras:$ time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real 0m8.622s
user 0m0.284s
sys 0m0.248s
This makes user+sys time go from 950ms to 532ms; if this process was CPU limited, that's as good as you could possibly get from running two cores.

Sorting out a few machine word dependencies so it runs on one of the Netra Solaris boxes (Linux/Solaris code is at tbray6.cpp; I was assuming size_t would be 64 bit on all 64 OSes), the results are:
tercel-2:$ time bin/tbray4 datasets/hundred-o10k.ap
matches: 110000

real 0m7.593s
user 0m5.638s
sys 0m1.894s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000

real 0m5.915s
user 0m4.027s
sys 0m1.855s
Here we get a much more significant speed up - about 25% faster overall time, due to a different balance between CPU and disk.

These simple benchmarks indicate a few things to me:

Not all gains from parallelism require concurrent processes - more transistors can mean wider data paths as well as concurrent cores. If I were using the architecture specific SIMD registers on the AMD64, then the paths would be twice as wide again.

A slower CPU with a 'server class' disk outperforms a faster CPU with a 'consumer class' disk in this test, but shows up potential to parallelise the code.

I like playing with bit-twiddly code on the weekends. I don't like having to debug bit-twiddly code in mission critical applications for my customers to a tight deadline. The optimisations should be in the regex library, rather than complicating user code - for example, this code avoids reading past the end of the buffer only because it has a pattern longer than the word size.

I'm also wondering whether some kind of hash based wide-scan can detect two or more characters of a pattern well enough to give an improvement, rather than just a wide-scan for first one - there's probably only a couple of bits of information per character in those log files.


fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/tbray6.cpp | wc -l

It's hairy but it's not huge.

Labels: , , , ,


Wide finder, parallelism and languages

Tim Bray, who knows about web search and was part of the XML specification process, is experimenting with exploiting parallelism in his wide finder project.

I'm interested in parallelism (running similar processes on multiple hardware) and concurrency (having multiple collaborating threads of control in a system), as many interesting problems are intrinsically concurrent, and as hardware isn't getting much faster but is getting wider, so will provide more parallel processing bandwidth.

The original problem is one of extracting the N most popular pages (not images) fetched from Tim's ongoing blog from the server's log files. These files are a few gigabytes in size.

Tim's ruby code runs in 13.5 seconds on a 1.67Ghz PowerBook, which is twin core G4. My use of my twin core 1.42 GHz G4 desktop petered out last year when I got a single core AMD64 laptop, as most of the time the laptop was more responsive. The laptop runs Ubuntu 7.04 64 bit.

Measuring the problem:

As it's not a lot of ruby code, it's easy to write a little C++ code to find out where the performance issues are.

First off I was playing with simple string matching (tbray1.cpp) vs KMP (tbray2.cpp), but as it's a requirement that the code should be as simple as possible, and KMP doesn't actually help modern CPUs as it's a jumpy algorithm, the third, simpler approach just calling strncmp works as well as far as using time to measure can determine (tbray3.cpp).
fortinbras:$ time bin/tbray3 datasets/original/thousand-o10k.ap
matches: 1100000

real 2m28.201s
user 0m10.193s
sys 0m4.752s
At that size of dataset, the laptop's CPU didn't get above 800MHz (it has 800MHz, 1600MHz, and 2200MHz speeds and stays at 800MHz). Smaller datasets which are already in the disk read-ahead buffer do cause its CPU to shift up; hence some of Steve Vinoski's figures for erlang times are not indicative of the problem - you don't run the same set of data through the system again and again, so you have to either load something else into the file cache to clear it (cat something-big > /dev/null), or use full-size datasets.

Counting non-comment lines which aren't a single { or }:
fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray3.cpp | wc -l
So a simple, sequential C++ implementation using a memory mapped file is 48 lines of code (my bracing formatting habits shouldn't count against the language), and isn't too slow.

Running the ruby for reference, which does a bit more in terms of counting occurances:
fortinbras:$ time ruby src/benchmarks/finder.rb datasets/original/thousand-o10k.ap
real 1m20.744s
user 0m24.710s
sys 0m4.420s
Presumably Tim's disk speed to CPU speed ratio is higher; on my laptop ruby's IO bound, and the CPU not maxxed, though it does shift to a faster speed. Fortinbras processes the same size dataset that Tim was using in 8.6 seconds.

But the ruby is doing IO much faster than my simple memory mapped code, so changing to use block IO rather than memory mapped (at the cost of a 25% longer program):
fortinbras:$ time bin/tbray4 datasets/original/thousand-o10k.ap
matches: 1100000

real 1m6.780s
user 0m9.593s
sys 0m2.464s

fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray4.cpp | wc -l
Again, the C++ implementation doesn't stress the CPU enough to get out of first gear, and it's about 14 seconds faster than the ruby, due entirely to less user CPU. I'd guess that the underlying C code the ruby interpreter calls for its line reading is similar; I don't know whether ruby strings are internally UTF-16 rather than UTF-8; if so that alone would account for CPU cost. I'm actually quite impressed that ruby isn't glacially slow, but I guess most of the work is between the lines of the program.

The C++ code also fewer lines of code than Steve Vinoski's 84 line erlang example, though the C++ doesn't show use of multi-core parallel processing. Parallel code in C++ can take more work. Concurrent code in C++ definately takes more work than in erlang.

Given an infinitely capable CPU and the same disk, ruby's 81 seconds or C++'s 67 will reduce to 55. If the CPU is using 12 seconds, to get anything from parallel CPUs, you'd need to get the data from disk in less than 12 seconds, which is probably not far off what current flash is capable of. I don't believe there's a software solution to make the IO much faster on the laptop's hardware.

Running on a Sun Netra T1 105, with a 440MHz Sparc CPU:
tercel-2:~/projects/tbray$ time bin/tbray4 datasets/thousand-o10k.ap
matches: 1100000

real 1m23.930s
user 0m57.070s
sys 0m24.636s
Much more CPU time, but the total real time is in the same ball park - the ten year old server has a fast wide SCSI disk but a slower CPU, so is close to being CPU bound.

If you have multiple cores and multiple disks to read from, you can launch multiple batch scripts to process different days' logs in parallel, like make -n or using MPI Scatter/Reduce.

MPI version

There's a simple extension of tbrayN.cpp with MPI at tbray5.cpp. I'm an MPI novice - most of the threading problems I've had to solve require concurrency rather than parallel processing. MPI defaults to use as many processes as there are cores, but you can force it to use a specific number of processes. On a single core machine, it slows from 1.2 seconds on a million line file with a single process to 14 seconds with 16 processes, to 54 seconds with 64 processes. Trying to launch 1000 processes causes the machine to spend forever context switching. Running lots of processes without context switching is what erlang is good at, but for optimal CPU throughput you only want as many processes as you have cores.

The MPI version loses in terms of conciseness, and has two separate code paths - the process to read the file into chunks, and those to scan the chunks. It's not as nice to read as the erlang. It comes in at 92 lines of code, 6 of which are error reporting that the erlang example lacks (presumably the error is passed to read-eval-print-loop to handle), so is within a couple of lines. Using pthreads would probably require more lines, as it lacks suitable message passing primitives.

Parallel languages, concurrent languages

IO aside, the actual processing seems to be a problem in parallelism - the task can be split up into work packets which can be processed independently, with a reduction at the end to a single result set. Only about one in nine lines contribute to the counts, so the dataset for the reduction is significantly smaller than the input dataset, so parallelising the matching and maybe one pass of sorting and counting should allow a speed-up on a multi-core machine.

Languages such as Sun's Fortress have intrinsically parallel constructs and atomic blocks to hide concurrency issues from the developer, and take advantage of multi-core hardware. Fortress has built in parallel for and reduction, equivalent to MPI code but without the programmer having to explicitly manage the processes.

In Fortress, the code for wide finder should be no more compicated than the ruby code; it's up to the implementation to parallelise the loop. Unfortunately, the current Fortress implementation is an interpreter on top of the JVM, and isn't speed optimised - it takes a couple of seconds to parse the scripts, and then interprets them rather than generating byte-codes.

Actor model languages such as Erlang, Alef, and Scala, are used where the problem is best expressed in terms of concurrent processes. Their implementation is designed to allow many, many concurrent processes on a finite number of hardware cores - they reduce requirements for locking, have strategies to mitigate blocking operations stalling other actor's execution, and solve many issues that OS using level threads for concurrent, communicating, mobile, robust processes exposes. I've written actor based code professionally where the problem fits that model.

The wide finder problem has no inter-process communication requirement until the workers terminate, it doesn't require more processes than there are CPUs, it doesn't write anything so doesn't suffer from multiple writer threads and locking issues, and it doesn't require much execution to continue if one worker blocks - the blocking will be due to IO, and it can't do anything anyway.

Erlang and Scala are optimised for the opposite problem to the wide finder problem - concurrency within a core rather than parallelism across cores. A better language choice would be one such as Fortress, which gives parallelism across cores without having to express the problem in anything other than terms of the operations on the data.


This problem shows up one of the reasons I'm working on compression for the quad-store - any data mining using spinning disks is IO bound; using compression is one way to shift work load off the IO pipe and onto the CPU. It's also why I spend a few tenners on a second fast disk for each of the Netras in the cluster, rather than a couple of thousand on a new multi-core box. I can't afford 30 dollars/giga-byte for a solid state wide-drive.

It's also an interesting comparison of compiled and dynamic languages - running ruby is CPU heavy, as heavy as running on the 10 year old hardware in my attic, but is allegedly easier to program in. I think I'd prefer it to Perl if I was to write text extraction and reporting scripts, but hate nearly all programming languages anyway once I have to use them. Some are more useful for certain problem classes.

Some of the discussion has been about approaches towards splitting the log file into chunks, and splitting it into lines.

None of the hits in the sample file require you to split it into lines - the pattern doesn't match anything it shouldn't if you ignore line breaks in the code. To split the file into lines you need to inspect every character; to match the pattern you may only need to inspect one in 18 (if it's not in "GET /ongoing/When/" you can then skip 18 characters and look if that's in the pattern, which is the basis of KMP); if the pattern matching can be done better than the line scanning, then that should give a boost on the CPU limited machine. It doesn't give a boost on the Netra, and I'm not sure why, possibly because KMP makes branch prediction worse.

All the code posted has been exact solutions. If you split a billion line file into 100 equal chunks for 100 cores to process, as you might using the built-in MPI Scatter function, you'll lose maybe 11 hits that get cut in half (as only one in nine requests matches the pattern), and you probably won't care - the most popular pages will have thousands of hits, so a dozen here or there won't be missed. Similarly, if you only want a 90% confidence for the most popular hits, you may only need to sample some of the log file. Whether a approximate solution is good enough is the sort of question you need to ask a customer before optimising for the exact case; it may well simplify the solution.


Labels: , , , , , ,


sshfs and subversion

I just lost a longer post as I'd forgotten to plug my laptop in, and was editing in SciTE rather than here, and the laptop powered off rather than hibernating (I've changed the setting now, System>Preferences>Power Management, the 'On Battery Power' tab, though why the default action was to destroy data I don't know. Bad Ubuntu.)

Anyway, I've got sshfs working instead of samba on the laptop, as SciTE doesn't understand smb: urls, and installed subversion on lure, my little server, so have a shared code repository between my machines. I also set up password less login so to use svn+ssh to connect to subversion from both fortinbras, my laptop and tercel-1, the Netra I've installed build tools on. I can now edit on either machine using the same tools, sync with svn, and build.

Now I might actually get round to doing something with the projects which have been sitting on my hard-drive for the last year or so.


Labels: , ,


Misc Admin - Laptop PCI, Bubba Static IP, More Disks

Laptop PCI Conflict

For a while, my laptop BIOS had been sporadically reporting a PCI conflict on boot, with no other symptoms.
It's dual boot Windows XP Home and Ubuntu, and after the latest Window Update it lost its on-board network adapter, both in XP and Ubuntu. Rolling back the update fixed Windows, but somewhere some persistent state made the PCI conflict happen around half the time, and if it did happen Ubuntu would not see the card, or be able to recover.
The logs said 'try pci=assign-busses', so I did (in GRUB use e to try it once, edit /boot/grub/menu.lst to make it permanent), and the adapter reappeared and (so far) there have been no conflicts on rebooting. In addition, Ubuntu can now see the SD card, (presumably that is what it was conflicting with) which means I don't have to boot into XP to snarf the photos from my camera.
Now if only the wifi button was broken enough to appear in the logs and explain how to fix it...

Bubba and Static IP

On Friday my new excito Bubba home server arrived. I plugged it in, switched it on and all was good.

Mounting Bubba Shares

There was a little playing around to get Ubuntu to see the Samba shares, which I also put on the excito forum thread which gave some clues:
The server appears in Nautilus when you choose Places > Network, as ftp, sftp and is also under Windows network.
Browsing into the server, it requests user and password as expected, adds them to the keychain, then shows the files.
For a one-off mount:
fortinbras# mount -t smbfs //bubba/home /net/bubba/home -o username=*****,password=***** 

Following the instructions at justlinux.com for adding to /etc/fstab, the following works with mount -a when run as root:

//bubba/home    /net/bubba/home     smbfs credentials=/home/pete/.sambacreds,uid=pete 0 0
//bubba/storage /net/bubba/storage  smbfs credentials=/home/pete/.sambacreds,uid=pete 0 0
On next reboot, the mount points appear in Nautilus, but you can't mount them as a user.
To get mount working for users, you need to set suid root on smbmnt and ensure that the mount point is owned by the user doing the mounting.
I haven't tried enabling mounts for more than one account; if you add lines to fstab for each user, it adds icons for the mount points to everyone's accounts; trying to use ~ for the mount point and credentials doesn't seem to work either.

Backup of tincancamera.com

I couldn't find a recursive authenticated ftp client for the Bubba - its version of wget doesn't accept the --ftp-user arguments - so I've used wget with htpp to crawl the site, which gets everything except the access logs. I hadn't read the access logs, and they were putting me over the hosting's quota (they just grow rather than rotating as I'd expect them to), so I got them and deleted them off the server, so I'm using 3MB of the 5 I'm paying for rather than 11.

Static IP

I'd like the Bubba to be on the net, but I don't have a static IP. I'm with Pipex, who are fast and reliable if not the cheapest at the moment, but in addition to their stated charge of £1.25 a month require that my existing service be upgraded to a current plan (the current plans are 8MB instead of 1MB for the price I'm on), but would mean a new 12 month contract. I'm moving in three months, so that would mean paying for nine months I'm not here at £26.24 a month, which is silly - it would make adding a static IP for three months cost £79.97 a month, 1p less than twice what Pipex charge for a dedicated server. I've mailed them to see if they'll give me a static IP on my current £24.99 a month for a reasonable set-up fee, but haven't heard back yet.

More Disks

I'd bought three more 18.4GB disks for my Netras off ebay (Seagate ST318437LC x 3 for £29.89 including p+p), which arrived this morning. After popping down the road to AJM Micro for two dozen mounting screws, they are now installed and (following the guide here - docs.sun.com is nearly as slow as del.icio.us) each give a 16.7G zfs file system for use as a data area for mucking around with:
tercel-1# zpool create data c0t1d0
tercel-1# zfs create data/fs
tercel-1# zfs list
data 104K 16.7G 24.5K /data
data/fs 24.5K 16.7G 24.5K /data/fs
tercel-1# df -h /data/fs/
Filesystem size used avail capacity Mounted on
data/fs 17G 24K 17G 1% /data/fs


Labels: , , , ,


Thinking more about compression

I'm using compression on the URIs and literals in the quad-store to trade CPU for disk/memory bandwidth. The current algorithm is LZW, with a 24bit index look up table. Yesterday's benchmarks indicate it's not doing massively good compression - ; on the other hand it's small and I can see how to preform regular expression searches on the compressed data. 24bit index tables take 64MiB and give around 1:5 compression - reduces to 23%. You can't reset the table like you do in a streaming compressor, since you want the mapping of compressed bits . Bigger tables take more room both - more entries, and the entries become larger as the number of bits to reference another entry increases. If you had 25bit indexes, you'd need 5 extra bytes per entry, so take 154MiB rather than 128MiB and possibly suffer due to cache line mis-alignment.

Since each value in the quad is reduced to a 32 bit identifier in the store, and at least one bit is needed to indicate whether it's fully described by the identifier (uncompressing the identifier's bit-pattern gives the URL), or needs a look-up to longer bit pattern. Obviously, the more URLs which can be squished into 32bit patterns the faster the store will be written to, as it removes the need to look elsewhere to match the string pattern of the value with one already in a hash table to check that the same identifier is only ever used for the same string. But 24+1 bits leaves 7 unused bits of space in half the possible identifiers, which reduces the maximum identifier density to (128+1)/256 = 50.3%, which isn't very good.

Looking at the benchmarks here, there seem to be much better methods than LZW, but most of them are neural net or model based, and I can't see how to map a sequence of hints for a neural net into anything that could have regex implemented on top of it without decompressing. They also are orders of magnitude heavier on the cpu, apparently enough to fade the total performance.

The other options are to not bother with the intrinsic identifiers - which means finding out how many of the URLs would fit in the LZW table for typical datasets (certain of the computer generated ones they would), finding other uses for the spare bits such as 11xx xxxx xxxx xxxx xxxx xxxx xyyy yyyy indicating a 23 bit LZW index and a 7 bit character. Maybe the simplest way would be to have aaaa aaaa xxxx xxxx xxxx xxxx xxxx xxxx where any non-zero in the a's indicates it's a hashkey, so you get full identifier use at the cost of more hashtable lookups.

Holding the full hash in memory isn't an option; even just a population bit would take half a gig. Since each node is single-threaded, it's possible to split URLs from literals and not care whether literals match outside of a smaller MRU table cache; that would increase search times for equality of literals with non-intrinsic ids, but not effect regex searches. It may also be worth maintaining separate LZW tables for literals and URIs. Eliminating duplications can be done in the background anyway.

I think the next experiment will be scanning the datasets I've got - swoogle's ten million triples, wikipedia in a couple of formats (dbpedia and the xml dump), and the uba owl generator - to see what hit rate and spread I'd get generating ids using LSW and hashes.


Labels: , ,

First compile and run on Solaris 10.

I've got first run of one of my quad-store benchmarks running on my Solaris 10 boxes.

I needed to make the machine architecture and a flag whether or not to use SSE2 environment variables, and add some extra #defines so none of the SSE stuff was included. I needed a macro to fake posix_memalign, though will think about using another malloc-based version. Compiling on Solaris also generates warnings for automatic promotion of -1 to an unsigned int so I added explicit casts (-1 is returned by some functions where 0 is a valid return and something is needed to signal an invalid value).

Running the compression benchmark, the sparc (440MHz) doesn't seem to get cpu bound for small files (100MiB):

tercel-1$ uname -a
SunOS tercel-1 5.10 Generic_118833-33 sun4u sparc SUNW,UltraSPARC-IIi-cEngine

tercel-1$ time bin/lzw_bench datasets/original/triples.sql 100
read: 100 MiB, compressed to: 21 MiB.

real 0m23.094s
user 0m20.973s
sys 0m1.715s

tercel-1$ time bin/lzw_bench datasets/original/triples.sql 1000
read: 1000 MiB, compressed to: 301 MiB.

real 3m55.124s
user 3m39.377s
sys 0m12.550s

fortinbras$ uname -a
Linux fortinbras 2.6.20-16-generic #2 SMP Thu Aug 30 23:16:15 UTC 2007 x86_64 GNU/Linux
fortinbras$ time bin/lzw_bench datasets/original/triples.sql 100
read: 100 MiB, compressed to: 21 MiB.

real 0m8.638s
user 0m5.784s
sys 0m0.296s

fortinbras$ time bin/lzw_bench datasets/original/triples.sql 1000
read: 1000 MiB, compressed to: 301 MiB.

real 0m56.176s
user 0m43.667s
sys 0m1.948s

The processor in fortinbras is Mobile AMD Athlon(tm) 64 Processor 3400+ at 2200 MHz, the Netra at 440 MHz, and appears to be about a quarter as fast at a fifth of the clock speed.

On the other hand, running the Netra at full doesn't heat my lap up, I've three of them, they've more, faster disks available and I can play with making the quad-store distribute between them.

Whether they hold up as well against the benchmarks which use SSE to make the Athlon wider I'll have to find out.


Labels: , ,

Feed me

Due to a previous error in my blogger config (from when I moved it from blogger.com to tincancamera), the wrong atom feed was being associated with this blog in the page's head. I've corrected it, and you now should be able to subscribe to the proper feed using the icon in the address bar of your browser. Or click here.



Dynamic languages tools

Link: http://blogs.activestate.com/ericp/2007/01/kid_adding_a_ne.html

I got a CD with Komodo Edit on it at XTech, but haven't tried it until now. The recent open-sourcing of the code, and then this post appeared on Planet Mozilla.

It looks like it would be good for DSL development, and playing with it Komodo Edit seems to be very usable. My current editor of choice is SciTE, which Komodo Edit embeds, but Komodo Edit gives projects, and language aware autocompletion and navigation (go to the definition of a function). I haven't found out if it's possible to go to definitions in other files, which would be very useful - I spend quite a lot of time doing Ctrl-C,Ctrl-O,[filename],Ctrl-F,Ctrl-V,Enter to jump around a JavaScript project. It also does polyglot programming - a simple XUL file with a script element switches nicely from XML highlighting to JavaScript highlighting in the script element's content.

The user defined language tool is lexer + state machine based; I'd have preferred it to be a packrat based one, which I've played with before. But I don't want to have to write a whole IDE, I like using the Scintilla editor control, and Komodo's approach seems to get the job done.

As a side note, del.icio.us seems awfully slow these days. You'd have though that Yahoo would be able to give it a couple more servers. I wonder if the competitors are any better.

Labels: , , , ,


More concurrent languages

Tonight I watched Advanced Topics in Programming Languages: Concurrency/message passing Newsqueak, which introduced the Newsqueak language which was developed into the Alef and Limbo concurrent languages in Plan 9 and Inferno.

Not much to add to it - it seems a little to little lambda-ey and not as cleanly calculus driven for a functional language, and it doesn't have curly braces for the real-world crowd. Maybe it was just old.

JavaScript with channels might be nice; the generator/trampoline threading library could be extended for that; a concurrent process with a blocking channel composed with a reader, and a generator function composed with a reader function are very similar.

What NewSqueak had was a strong tail-call bias; having small stacks is good for lightweight process or transducer composition languages. In the various talks and papers I've seen, I still haven't seen much on using mobile processes to eliminate locks - if you can group the execution of transducers into threads so those which interact are serialised correctly, then you don't need to acquire locks when communicating over channels. I'd really like to be working on that, rather than current work (mainly UI work as that's what needs doing on the project, with the implementation for access control in the XMPP server we're producing in the near future). The simulation system is based on a pub-sub engine, and each property is locked on read or write. Entities behaviours are triggered by a global tick event, which cascades through a graph of transducers configured at runtime. When the tick completes, state is published to subscribing entities, which can trigger further transducer cascades. The UI elements interact using the pub-sub and request/response messaging over XMPP. Multiple UIs can attach to the simulation server, and the server can use services on other machines via XMPP.

Currently, each transducer object holds its own state and pointers to the channels to other transducers, and write events are implemented as virtual function calls using inversion of control. There's no recursion - values are not stored on the stack and if you tail-call you exhaust the stack in 300 events. The main tick process works in a single thread. The system also uses a TCP library which launches a thread for each connection, so you can end up with one thread doing lots of work and hundreds waiting for input. Since the simulation engine builds on a generic thread safe pub-sub system, there are quite conservative locking strategies, and many of the reads or writes to channel require acquiring locks.

I'd like to move them all to an actor model style system, which integrates with the TCP select, so there are fewer threads doing more work, and without the current call-stack limits. Getting a graph of what can read the pub-sub channels and only using locking where they need serialising across threads (or locking them once for use by all actors active in the same thread). Unlike NewSqueak, the system uses asynchronous buffered channels for request/response semantics, so there would need to be a mechanism for posting channel events somewhere; currently there is a single queue with locking - that would need to be changed to queues per thread, and then a mechanism to allow the queues to be shared out at the next sync point. Some of the behaviours (such as discovering a service and then opening a channel to it) would need to be synced across threads always; these would block the actor until the next sync point.

If you're only syncing threads at specific points (where you also can redistribute actors between your threads), you can get rid of the OS mutexs on channels and use single bits in an array - you're only asking whether a process uses a channel in the next operations, and you only need to ask once it once. The actor can then wait if its channel usage changes, letting its thread process the other actors.

Maybe I should just give up and try to find an Erlang shop rather than having to reinvent it all in C++ for performance reasons.



(N+1)^2 = N^2 + 2N + 1

My mind drifted when sitting on the loo at work the other day, making patterns on the cubicle tiles...

(changed title to more clearly reflect the idea that adding a row to two adjacent edges then one tile creates a new square, which the animation shows if you're viewing in a SVG enabled browser, such as Firefox or Opera.)

Labels: , ,


Things I'll forget: Getting gcc, make and curl on Solaris 10

More admin stuff I'll forget if I don't record it:

Getting Curl

Curl is on the Solaris Companion Software DVD.

Having downloaded the companion DVD image, I mounted it on the Mac (double clicking the ISO image).

I then used Transmit to ftp the Sparc packages onto the 9GB install drive on tercel-1:/solimage/companion.

The readme for the DVD gives install instructions.

Mounting the DVD contents between the servers using NFS:
tercel-1# share /solimage/
tercel-2# mkdir /solimage
tercel-2# mount tercel-1:/solimage/ /solimage/

Installing the packages from the /solimage/companion/Packages directory:

tercel-N# pkgadd -d /solimage/companion/Solaris_sparc/Packages/ SFWcurl

Adding /opt/sfw/bin to PATH in ~/.bashrc means curl is on the path, and works when invoked - curl http://www.tincancamera.com gets the source for the homepage.

Finding gcc and make

You need . .bashrc in ~/.profile for .bashrc to be executed on login.

The additonal paths I've got so far are:

/opt/sfw/bin/ for curl.
/usr/sfw/bin/ for gcc

I wasn't entirely sure why curl is in /opt rather than /usr, but An Accelerated Introduction to Solaris 10 clarifies what the bits of the file system are for.

The make in /usr/ccs/bin/ is Sun make, not gnu make, which is /usr/sfw/bin/gmake. Add alias to .bashrc:

export PATH=$PATH:/opt/sfw/bin/:/usr/sfw/bin/
export PS1="\\h\\$ "
alias make=/usr/sfw/gmake

Currently I'm synchronizing these files between the tercels using ftp; I've a Bubba server on order so will move all my profiles onto that, since it's designed to be always on in a domestic setting.




Things I'll forget: Installing Solaris 10 on Netras

This is what I did to install Solaris 10 on my Netra servers.

Firstly, downloaded CD images and burnt them to CD using a Mac. I probably could have saved a little time and some plastic mounting the ISOs on the Mac and accessing them over the net from the Netras, but there was enough new stuff to do anyway.

Referring to my previous shared memory entry and to Andy's Solaris disk faq I mounted a second 9GB drive on tercel-1 (my Netras are given the name 'tercel', being a small falcon, and numbered 1 to 3) to take the files for the install server. This was so I can move that disk to another machine after installing Solaris 10 on it, to use that as the install server for tercel-1.

I then followed the instructions for Creating a SPARC Install Server With CD Media and the following sequence to reboot tercel-3 from the net and install.

The install proceeds using my old sony 505 laptop connected to the LOM port, and sshing into it and using minicom. I do have a xyplex terminal server, but haven't ever got round to setting it up.

The first attempt at booting to the installer failed as I'd not selected VT100 as terminal, so got garbled screens. The second one stalled as I left it and went to bed halfway through the copying files stage and something timed out and dropped a connection.

The third attempt suceeded and tercel-3 booted into Solaris 10 from disk and was accessible via its LOM console.

I couldn't use ssh into the root account, and from the last time I set up a Solaris box, I'd forgotten how to setup user accounts (I wanted a clean install here too, having forgotten what I might have messed with a couple of years ago, so overwrote the previous settings completely).

To add user 'pete' with bash as login shell and create home directory in /export/home/:
bash-2.05# useradd -s /bin/bash -m -d /export/home/pete pete

To set password for user 'pete':
bash-2.05# passwd pete

Having done that, I could then login via ssh and change to super user using su as before.

I then brought all Netras down, and swapped the install server image disk from tercel-1 to tercel-3.

Using the previously mentioned procedure, I installed the Solaris image server disk:

tercel-3 console login: root
Last login: Sun Sep 2 14:17:12 on console
Sep 2 16:10:49 tercel-3 last message repeated 1 time
Sep 2 16:11:26 tercel-3 login: ROOT LOGIN /dev/console
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
# sync;sync;init 0
# svc.startd: The system is coming down. Please wait.
svc.startd: 83 system services are now being stopped.
Sep 2 16:11:54 tercel-3 syslogd: going down on signal 15
umount: /export/home busy
svc.startd: The system is down.
syncing file systems... done
Program terminated
ok setenv auto-boot? false
auto-boot? = false
ok reset-all
Resetting ...

Netra t1 (UltraSPARC-IIi 440MHz), No Keyboard
OpenBoot 3.10.25 ME, 1024 MB memory installed, Serial #14251450.
Ethernet address 8:0:20:d9:75:ba, Host ID: 80d975ba.

ok probe-scsi-all
Target 0
Unit 0 Disk IBM DDYST1835SUN18G S94A
Target 1
Unit 0 Disk SEAGATE ST39204LCSUN9.0G4207

ok setenv auto-boot? true
auto-boot? = true
ok reset-all
Resetting ...

Netra t1 (UltraSPARC-IIi 440MHz), No Keyboard
OpenBoot 3.10.25 ME, 1024 MB memory installed, Serial #14251450.
Ethernet address 8:0:20:d9:75:ba, Host ID: 80d975ba.

Boot device: /pci@1f,0/pci@1,1/scsi@2/disk@0,0:a File and args:
SunOS Release 5.10 Version Generic_118833-33 64-bit
Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Hostname: tercel-3
Sep 2 16:18:36 /usr/lib/snmp/snmpdx: can't open the file
Sep 2 16:18:36 /usr/lib/snmp/snmpdx: can't open the file
checking ufs filesystems
/dev/rdsk/c0t0d0s1: is logging.
/dev/rdsk/c0t0d0s5: is logging.
/dev/rdsk/c0t0d0s7: is logging.

I'm not sure whether that is actually necessary; the drives are supposed to be hot-swappable after all, but I don't know how to power a drive down and park it, or detect it without going to the LOM prompt.

Logging in and configuring and mounting the drive:

tercel-3 console login: root
Last login: Sun Sep 2 16:11:26 on console
Sep 2 16:21:13 tercel-3 login: ROOT LOGIN /dev/console
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
# bash
bash-3.00# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c0 scsi-bus connected configured unknown
c0::dsk/c0t0d0 disk connected configured unknown
c0::sd0 disk connected configured unknown
c1 scsi-bus connected configured unknown
c1::dsk/c1t2d0 CD-ROM connected configured unknown
bash-3.00# cfgadm -c unconfigure c0::sd0
bash-3.00# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c0 scsi-bus connected configured unknown
c0::dsk/c0t0d0 disk connected configured unknown
c0::sd0 disk connected unconfigured unknown
c1 scsi-bus connected configured unknown
c1::dsk/c1t2d0 CD-ROM connected configured unknown
bash-3.00# cfgadm -c configure c0::sd0
bash-3.00# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c0 scsi-bus connected configured unknown
c0::dsk/c0t0d0 disk connected configured unknown
c0::dsk/c0t1d0 disk connected configured unknown
c1 scsi-bus connected configured unknown
c1::dsk/c1t2d0 CD-ROM connected configured unknown
bash-3.00# format
Specify disk (enter its number): 1
selecting c0t1d0: sol10img
format> part
partition> print
Volume: sol10img
Current partition table (original):
Total disk cylinders available: 4924 + 2 (reserved cylinders)

Part Tag Flag Cylinders Size Blocks
0 root wm 0 - 73 129.75MB (74/0/0) 265734
1 swap wu 74 - 147 129.75MB (74/0/0) 265734
2 backup wu 0 - 4923 8.43GB (4924/0/0) 17682084
3 unassigned wm 0 0 (0/0/0) 0
4 unassigned wm 0 0 (0/0/0) 0
5 unassigned wm 0 0 (0/0/0) 0
6 usr wm 148 - 4923 8.18GB (4776/0/0) 17150616
7 unassigned wm 0 0 (0/0/0) 0

partition> q
format> q
bash-3.00# mkdir /solimage
bash-3.00# mount /dev/dsk/c0t1d0s6 /solimage/
bash-3.00# ls /solimage/
Copyright lost+found
installer Solaris_10
bash-3.00# cd /solimage/Solaris_10/Tools/
bash-3.00# ls
add_install_client Installers setup_install_server
Boot miniroot_extra
dial rm_install_client

So we've got the Solaris 10 image server disk mounted and can list it.

bash-3.00# ./add_install_client -e 08:00:20:D9:A0:B2 tercel-2 sun4u
Error: unknown client "tercel-2"

I hadn't restored /etc/hosts, so it didn't know tercel-2.

I added the IP addresses for the machines on the local net to /etc/hosts, then trying again:

bash-3.00# ./add_install_client -e 08:00:20:D9:A0:B2 tercel-2 sun4u
Adding Ethernet number for tercel-2 to /etc/ethers
saving original /etc/dfs/dfstab in /etc/dfs/dfstab.orig
Adding "share -F nfs -o ro,anon=0 /solimage" to /etc/dfs/dfstab
making /tftpboot
enabling tftp in /etc/inetd.conf
Converting /etc/inetd.conf
enabling network/tftp/udp6 service
enabling network/rarp service
enabling network/rpc/bootparams service
updating /etc/bootparams
copying boot file to /tftpboot/inetboot.SUN4U.Solaris_10-1
bash-3.00# ./add_install_client -e 08:00:20:DA:09:C6 tercel-1 sun4u
Adding Ethernet number for tercel-1 to /etc/ethers
updating /etc/bootparams

Now it's time to reset tercel-1 and boot it off the net so it can run the installer, which requires swapping the serial connection over to its LOM. I do actually have a proper multi-serial port to net box instead of, but have never got round to setting it up.

tercel-1 console login: root
Last login: Sun Jan 22 23:05:38 on console
Sep 2 15:25:50 tercel-1 login: ROOT LOGIN /dev/console
Sun Microsystems Inc. SunOS 5.9 Generic May 2002
bash-2.05# sync;sync;init 0
INIT: New run level: 0
The system is coming down. Please wait.
System services are now being stopped.
Print services already stopped.
Sep 2 15:26:20 tercel-1 syslogd: going down on signal 15
The system is down.
syncing file systems... done
Program terminated
ok boot net - nowin
Resetting ...
Netra t1 (UltraSPARC-IIi 440MHz), No Keyboard
OpenBoot 3.10.25 ME, 1024 MB memory installed, Serial #14289350.
Ethernet address 8:0:20:da:9:c6, Host ID: 80da09c6.

Executing last command: boot net - nowin
Boot device: /pci@1f,0/pci@1,1/network@1,1 File and args: - nowin
SunOS Release 5.10 Version Generic_118833-33 64-bit
Copyright 1983-2006 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.

nowin makes me think of 'no-win situation' but that's nothing to do with installing Solaris.

So it's booted into Solaris 10 from the net, as it's now a question of going through the installer screens again:
The machine is on a subnet
It's using the router downstairs
It's not using any naming services
It's in Great Britain
Set the time (tercel-1 had drifted five minutes)
Enter root password
Enable remote services
(I'm not sure quite what that means, since I can't ssh the server at this point, which would be useful to free the serial cable so I could install multiple machines)
Standard network install with auto reboot (and auto eject of CDs though that's not relevant)
Accept the NFS location of the install images
Choose Initial system - overwrites everything on the boot disk
Select locale as Great Britain UTF-8
No Solaris 10 extra value software
No Web Start Ready software
Entire Distribution (there isn't a headless option)
Then go through lots of packages deselecting anything with a GUI
And when you get to 'Z' you realise there's still 'a' after it as it's ASCII
So give up and hit OK
And are warned that you've deselected some packages (such as X) that other software depends on
So reset and install everything. Maybe later I'll find out how to make an image with just headless Solaris for developers, so I don't have to go through hundreds of packages on each machine, but this is too painful to be worth the effort.
Select a disk (there's only one on this machine)
Continue, don't preserve data.
Use AutoLayout
Select all file systems for AutoLayout
(including /usr/openwin which should be necessary on a headless system)
Accept the auto-layout
Don't mount any software from remote server.
Begin installation
OK to change default boot device
Watch the little bar go across the screen as it writes 3.8 GB of stuff to the drive.
Make a coffee and check to get the terminal server working - it requires a serial connection to tell it which IP to use, and since my only serial connection is in use, I can't set it up at this point.
Read a little on creating flash archives rather than doing this again for the next machine.
Skip additional languages and software.
It does seem to take quite a while to load smf service descriptions, which wasn't part of Solaris 9, so the first boot is much slower than previous OS boot.
I'm not sure how long subsequent boots take.
Add user 'pete'.
Login via ssh and check that can su and get control.
Disconnect LOM cable.

Repeat with next machine.

Set up dns without specifying a domain:
Add all local machines to /etc/hosts
Create /etc/resolv.conf with nameserver
Create /etc/defaultgateway with
Copy nsswitch.dns as the active configuration: cp /etc/nsswitch.dns /etc/nsswitch.conf

Started stripping out all packages related to not-present hardware (CD/DVD writer, USB, firewire) and to desktop, but ended up with a corrupted locale (even on a headless machine the locales require fonts, which is a bit of a surprise) and had to reset and install again.
Installing as an update would entail putting all the packages back in manually.
It would be really nice if there was a simple way to say "I don't want Gnome or X, thankyou" rather than unchecking hundreds of packages and getting no feedback on dependencies.




OS Update frenzy

I'm not actually off having a life this weekend, and have run out of space on my laptop's linux partition, so am in the process of an upgrade frenzy.

I've archived all the data off the laptop onto my mac, and repartitioned the laptop from 45GB Win/45GB Ubuntu to 20 Win/ 20 Ubuntu/ 55 Data partitions, and will re-install ubuntu and an ext3 driver for Windows so I don't have half my data on one partition and half on the other. It's a real pain that it doesn't come with media for XP, as there's a lot more crap on it that I could get rid of if I could do a clean install (not only in terms of disk space - installing and un-installing some DVD software has left the audio system in mumbling mode), but I don't want to have to pay Microsoft to re-install an OS I've already paid for when I bought the laptop. The two things I use Windows for are games (mainly Runescape these days; maybe the next version of Ubuntu's Java plugin will be more stable) and to stepwise debug C++ code in Visual Studio - I haven't found anything like as good a debugger for linux yet.

My iPod battery died in Romania, and it's still not holding any charge; I'm not sure whether to upgrade that or not - it mainly gets used in my car on a 30 min commute each day, which charges it a bit, and the battery doesn't seem to be designed for that use (or the UI for that matter - you can't change albums either through the iPod or the head when it's plugged in, otherwise it wouldn't be an issue).

Having thought more about nodes for the quad-store, reconnected the three netras in the loft and am installing a clean version of solaris 10 on them, so I can see how it works on true 64bit hardware and server class disks, and don't cook my groin running the cpu in the laptop.

The main reason I don't use the netras much is that they are noisy and use lots of electricity, so I've also ordered an excito bubba, a fanless home server which draws 10W at full load, so can be left on in the corner behind the telly and used for file sharing.

Depending what happens to the laptop and bubba combination, and whether it's worth fixing the iPod, I may get rid of the mac, since the only thing I use it for that's OS X specific is re-installing the iPod's software when it goes wrong (which it does a couple of times a month), and archiving my mail accounts (which the bubba does automatically).


Labels: , , ,