Just the guy I'm looking to talk with!
I've charged that published benchmarks are not always valid because people quote them out of context - or what it is they do is not clear to end users and that it takes expert, objective help to interpret benchmark results properly (hence my crack that Smartbench doesn't walk on water earlier(*)).
Can you share - or provide links on your site to - info that might help us understand how to understand what your benchmarks mean and how to interpret the results?
~~~
(*) I used to write benchmarks for vector performance and compiler optimizations, back in the day, that were adopted by Cray Research; some of that old code made its eventual way elsewhere - and from that viewpoint, I feel qualified to say no benchmark walks on water, without it being personal to you, the author in this case.
I agree that no benchmark will exactly duplicate user experiences but as long as you know how to interpret the results benchmark produces, I believe there is a value (otherwise I would not have produced Smartbench in the first place
)
What a co-incidence - I also worked on compiler optimization, in my case, for an earlier version of Sun's SuperSparc architecture which at that time, was mainly focused on multi-core and super-scalar architecture. It involved modifying GNU GCC compiler that produces more optimized binaries for this particular architecture.
To give you a general idea, Smartbench 2011 has 5 independent mini-test suits. Pi, Mandelbrot and String tests focus on stressing CPU, while Tunnel and Jellyfish focuses on stressing GPU. CPU tests were re-written for v2011 so that the workload is split equally between 4 threads, hence it will consume up to 4 cores if they are available.
Pi test is just that - it calculates Pi, using 4 separate threads but combines the results. Very tight loops of integer and floating point calculations are used, so there is a good chance that this test will run entirely within L1 cache.
Mandelbrot obviously plots the famous Mandelbrot graphs but I intentionally divided the screen into 4 squares, so that each area can be updated independently by separate threads. This one uses much larger loop, but still focuses on integer and floating point calculations.
String does series of tests on strings - assignments, comparisons, copying and regular expression parsing. Strings used are quite large - my hope here is to completely fill up L1 cache so that it would test CPU to memory interface more thoroughly.
Tunnel test looks simple but I've intentionally increased the load by splitting each wall into many smaller triangles. This was done to make sure no devices hit near 60fps limit. Tegra 2 processors are only doing around 30fps or so in this test. Uses various OpenGL calls that are common in games of today.
Jellyfish, again uses many semi-transparent cubes with textures on it. From what I've seen, some devices do better on Tunnel test, while others do better on the Jellyfish test.
As you can see, what I want to do here is to create more mini-tests that mimic some of the common tasks/operations that phones need to execute. More mini-tests will be developed over time. You will see some new ones in v2012. This is one of the reason why I used the year as a version number - once more mini-tests are added, I need to create a new baseline for the test results, hence all previously captured test results are no longer comparable to the new ones.
Productivity Index is simply an average of 3 CPU benchmark results, normalized to HTC G2 (1000). Games Index is again, an average of 2 GPU benchmark results, again, normalized to HTC G2 (1000). If you get 2000, that means you completed the test twice as fast as 800MHz G2.
Not sure if this answered any of your questions. If you have any more, I will gladly respond.
My apology to everyone, this post is sorta off the track.