2006-06-11

How fast is Java?

Link: http://blogs.sun.com/roller/page/dagastine?entry=java_is_faster_than_c

In response to David Dagastine's use of the SciMark numeric benchmark, where he finds that there's very little difference between in speed between Sun's JVM and native C compiled with Visual C++ 6, using a numeric benchmark that uses static methods to manipulate arrays of primitive data types, primarily doubles.

This agrees with my experience for numerical code. When I was lead software engineer in the technical computing group of the world's largest aerospace company, I moved several numerical projects from Fortran or C++ to Java. At the time, one the major reason were bugs in VC's optimisation code which meant the results were wrong, and discrepancies between release and debug builds that made C++ bugs harder to track, whereas Java had performance within 10% and much stronger guarantees of its results.

But MS got round to fixing the bugs, and in VS 2005 appear to have made some major gains in performance.

On my laptop (WinXP SP2, AMD Athlon 3400+), comparing Java 1.6.0-beta with VS 2005 express gives 388.67 and 631.12 mflops respectively, which is a much bigger difference than I observed between Java 1.4 and VS6. The SciMark code is portable C, so they don't use C++ intrinsics, which can give an order of magnitude speed up for certain code (though since it's working in double precision you probably wouldn't get that much improvement for quite a lot less readability, as you can only process 2 doubles at a time with SSE2, but can process 4 floats, and so keep whole vectors in single registers).

So I'd disagree that Java 1.6 is faster than C/C++ in this case on the two counts - firstly it's significantly slower than the C code it was benchmarked against, and secondly SciMark is not using current C++ techniques for optimising numeric code, which any serious numerical library would.

The SciMark code is optimised to a similar degree on both C and Java platforms, but the nature of the Java language means that there are further optimisations which are impossible (due to not having access to intrinsics or inline assembler), and many more that go against Java's intent of being a safe object oriented programming language (such as passing around char arrays rather than String objects).

I've observed before, that without any kind of meta-programming facility, writing optimised numerical Java is very painful (though I haven't had cause to try again - even with better function inlining, you have to copy all the parameters you need, and construct and object for any result which is more complicated than a single type, so it doesn't impact on the complexity much). For small projects, you can do it be hand, for others you have to either use a code generator from a higher level model, or just put up with it running slow. It would be nice if having to hard code optimisations disappear from Java code as Sun improves the JVM's capabilities. For example, copying all state out of an object into local variables, then performing a tight loop, then copying back and gives a measurable performance boost. The same applies for nested loops on multidimensional arrays - you hoist the reference to the row. Currently you have to do such tedious code by hand, though Fortress seems to favour such optimisations in its transactions, so may be a better way ahead for numeric code.

For something a little closer to Java's enterprise use, say date formatting functions, the difference between Java6 and C++ in small benchmarks is closer to a factor of 7 slower, for example see this thread.

However, what I learnt in that thread isn't that if you port C to Java it runs seven times slower (usually it's not as bad as that, and as the SciMark above shows - a good programmer can write Fortran in any language and it'll run quickly), but rather that even good Java programmers won't think of solving the problem in the faster way, but use Java idioms instead.

The idiomatic Java way of solving that problem was twenty times slower than the C code. If you're used to thinking that StringBuilder is faster than StringBuffer is faster than concatenating String, you won't write the sort of code that works fast, however good the VM gets. The granularity at which safety is assured in Java - immutable objects - means you can't optimise across method calls. You need to wrap up a string as an immutable object to protect it from modification, rather than proving its const via the type system.

Java is designed to be a safe, object oriented language. Fast code either operates above this level - for example, using a DSL specify that you want to multiple a matrix row by factor and it'll generate the code inline for it which works on the primitive types - or below this level, such as C can do, and the generated code would. Having to work only with objects with a statically typed interface means you don't have the flexibility to get a high level view of the logic, and get to feel the bits between your toes if you want to.

The granularity problem in Java also applies to array bounds and other checks - much of what can be known at compile time is lost by the Java type system. The attempt to fix this with generics didn't get far in terms of capability - you can only specify the generic type as far as the current Java type system allows, and the information is erased at runtime, so the JVM can't selectively remove redundant checks (other than checkcast). Traditionally, this hit you in loops of a known number of iterations against an array of known length – the array access instruction would still check the bounds. That one may have been mitigated, but only by means of specifically getting the JVM to look for that particular pattern. Traits allow you to provide a general mechanism. For example, a method which can only take a square matrix only needs to check its arguments if they are not known to be square at compile time, which depends where the method's being called from, so requires a backwards possible-caller analysis, and different code to be generated based on the results. Having an NxM matrix with a trait for squareness, also square matrix subclasses with the trait fixed, and knowing this trait doesn't change after the object is created would allow the check to be done both at compile and run time as appropriate. I'm not aware of any language yet that has that level of sophistication of traits inference. (I also don't like the compiler's lack of escape analysis for generics - if you've only ever put a Foo into a list, then you shouldn't have to tell it that it's a list of Foo. But that's for another time, as I'm not doing much Java programming in the real world now, and getting on to how traits can improve dynamic language performance is another post.)

So I don't think the JVM will ever be the fastest thing. Java's too much in the middle ground:

  • The language lacks the meta-programming facilities that hide the tedium of hand optimisations

  • The language lacks access to low level instructions that allow optimisations

  • There is no concept of a cross-object inference to prove unsafe operations cannot happen, rather it relies on information hiding to prevent them, which forces conservative restrictions and redundancy

  • Java idioms use object level safety and synchronised blocks, rather than traits and transactions. This prevents many possible optimisations based on direct access, thread-safe caching, and often requires redundant checks on information already known.


There were good reasons for these decisions in the evolution of Java, but the effects of them won't ever go away, and as long as it's the Java virtual machine, the JVM will have some compromise on its performance.

But it never was intended to compete with Fortran, and is fast enough for many uses.


TME

Labels: , ,

2 Comments:

Blogger Pete Kirkham said...

The command line is the one from the SciMark2 README:

java jnt.scimark2.commandline

The version of Java being run is:
Java(TM) 2 Runtime Environment, Standard Edition (build 1.6.0-beta-b59g)
Java HotSpot(TM) Client VM (build 1.6.0-beta-b59g, mixed mode, sharing)

I couldn't find any option to install the server VM for the i586 JRE for Windows, and the AMD64 installer doesn't recognise my processor type (which may be its way of telling me I'm not running 64 bit Windows).

So if you install the available JRE follow the instructions for the benchmarks out the tin on the hardware I have, you get what I do.

But like I said, a factor of two when running static functions on primitive arrays isn't anything compared with the costs incurred with Java as Java programmers write it.

The same happens in the transducer nets (written in C++) that I'm working on in my day job, so asking 'how can we ensure safety if the compiler is allowed to break encapsulation' is more interesting than 'my runtime is faster' questions.


TME

Wednesday 26 July 2006 at 11:36:00 BST  
Blogger Pete Kirkham said...

Using the -server VM from the JDK on my (non-'server class', but still a few times faster than the Netras in the attic) laptop, gives these results:

bash-2.04$ /c/Program\ Files/Java/jdk1.5.0_07/bin/java -server -cp build jnt.scimark2.commandline

SciMark 2.0a

Composite Score: 463.65839147640264
FFT (1024): 390.38510862050515
SOR (100x100): 712.8949164074697
Monte Carlo : 103.48320953698396
Sparse matmult (N=1000, nz=5000): 267.16672713213075
LU (100x100): 844.3619956849234

java.vendor: Sun Microsystems Inc.
java.version: 1.5.0_07
os.arch: x86
os.name: Windows XP
os.version: 5.1
bash-2.04$ /c/Program\ Files/Java/jdk1.6.0/bin/java -server -cp build jnt.scimark2.commandline

SciMark 2.0a

Composite Score: 483.79811594142984
FFT (1024): 439.5767813776436
SOR (100x100): 696.6576230697497
Monte Carlo : 120.15911365091783
Sparse matmult (N=1000, nz=5000): 301.7311329273631
LU (100x100): 860.865928681475

java.vendor: Sun Microsystems Inc.
java.version: 1.6.0-rc
os.arch: x86
os.name: Windows XP
os.version: 5.1


That's a composite score of 483.80, which is about 100 better than the client, but still 150 behind the latest free MSVC++.

I also messed around porting the C code to C++ - using template collections for arrays etc - and got 566.35. MSVC++ doesn't appear to be hoisting the offset from the reference to the matrix object to the pointer to the memory holding the matrix's data out of any of the loops (it inlines operator[], but either doesn't detect that the offset is an invariant, or doesn't hoist outside the function's scope). So there's a cost of making the code even a little bit more maintainable and safer than C.


TME

Wednesday 26 July 2006 at 15:11:00 BST  

Post a Comment

<< Home