(to rhyme with silly string)
Since I'm not seeing the fine lass this weekend, I've spent some time twiddling my bits - playing with applying SSE2 to string algorithms.
SSE2 operations shift the costs of certain operations drastically, turning O(N*M) to O(N) for M <= 16, and making register comparisons far faster than loads. I've played with algorithms for searching for a single character in a string and for searching for a substring. The platform is WinXP, AMD 3400 mobile. Applying SSE to the naive algorithm for finding a character - loop and compare each - runs around 5.3 times faster; using _mm_cmpeq_epi8 to compare 16 bytes at a time.
Applying SSE to the naive algorithm for finding a substring - find the first character in the substring, then test the remainder - gives a 2 to 5 times speed up (best case is where the first character never occurs in the string) using SSE to match both the first character (compare a block of input with the first character 16 times in parallel), then to compare with the remainder (compare the current input to the pattern 16 bytes at a time).
Using Boyer-Moore on the same input gives results usually between the two. Attempting to use SSE for the inner part of Boyer-Moore doesn't help - extracting the index of the first mismatch from the 16 bit pattern indicating which characters in the pattern match the input costs more than the non-SSE version.
Boyer-Moore-Horspool is faster than Boyer-Moore, and faster than naive+SSE for longer strings, but its simplifications don't seem to be enough to admit any obvious parallelisations. The speed advantage over Boyer-Moore is probably that the algorithms are implemented as functions, and Boyer-Moore has to allocated and free an array on each call. So the performance would be different for a pattern matcher object.
Knuth–Morris–Pratt seems to under perform on these tests, again probably due to the dynamic memory.
I need to think about SSEifying Boyer-Moore-Horspool and Knuth–Morris–Pratt, and create another test harness that allows for pattern objects so the initialization of the tables in the more complicated algorithms doesn't count against them. But in a single function use, it appears that naive is best when the first character is rare, and Boyer-Moore-Horspool where the string is longer. Rareness also helps the algorithms with dynamic memory, as each is called until all the input is scanned, and the rarer the pattern the fewer the times the function is called.
Before I do that, I'll get the core object system in kin working (see next post) - I'm just about happy to go on with that now.