The Dhrystone benchmark, the LPC2106 and GNU GCC

The KEIL web site has listed benchmarks for various compilers, naturally with the goal that its own compiler comes out as the better choice. Surprisingly, especially the difference with the GNU GCC compiler was impressive: the code produced by GCC compiler was five to six times slower than that made by commercial compilers. The KEIL compiler (then in beta) produced the smallest code, which was still very near top-performance. These tables also appear in the book "The Insider's Guide to the Philips ARM7-Based Microcontrollers" by Trevor Martin.

Before going on, allow me to cite the influential book "Performance and Evaluation of Lisp Systems":

"Benchmarking without analysis is as useless as analysis without benchmarking."
Richard P. Gabriel, 1985

What I was missing in the KEIL benchmark comparison page was exactly that: the analysis. Especially with such a large discrepancy, an analysis should have been provided. The aforementioned book "The Insider's Guide to the Philips ARM7-Based Microcontrollers" postulates the conclusion to steer clear of GCC for any serious embedded development, based on the data in the tables alone! (In Trevor Martin's defense, the benchmarks, and his conclusion, are just a side remark in his book.)

I wrote this page to demonstrate how true it is that benchmarking without analysis is bogus. Below are the results of my benchmark, which should match those by KEIL more or less (but do not). You can compare these results with those on the the archived benchmarks page by KEIL, you will see that the results are very, very different. (I never copied the tables from the KEIL page here, because I did not want to replicate its controversial data on our site. Since KEIL has removed the pages, the link now goes via web.archive.org.)

Dhrystone 2.1 - the results

Compiler + flags	Image size	Dhrystones / second
GCC, ARM mode, non-optimized	8092	18,558
GCC, ARM mode, optimized (-O2)	4536	51,488
GCC, Thumb mode, non-optimized	6428	15,225
GCC, Thumb mode, optimized (-O2)	3660	44,239

This test was performed on an LPC2106 microcontroller running at 58.9824 MHz (a common 14.7456 crystal multiplied by 4). The number of "Dhrystones per second" is the time spent in the benchmark divided by number of iterations of the benchmark (in my case, the number of iterations was 500,000). As apparent from its figures, KEIL uses the same definition. My controller should be close enough to the test setup of KEIL (µVision simulator that emulates a NXP/Philips LPC2294 running at 60 MHz in Thumb Mode). Still, my benchmarks run significantly faster than what KEIL documents, and the sizes of my programs are smaller than any of the benchmark programs from KEIL. It is obvious that, despite the similarity in the specifications of our benchmarks, KEIL and I have tested entirely different systems.

LPC2106 versus simulated LPC2294
It is a rather bold statement to say that testing on an LPC2106 at 58.9824 MHz is close to testing a simulated LPC2294. Several individuals have criticized me for this remark, and rightly so: I was wrong. See section "Observations" for details.

For the record, I used the K&R C version of Dhrystone 2.1, with the two parts compiled as separate modules (not merged), no inlining, and a custom (unoptimized) function library for driving an LCD, accurate timing and a few other things. GNU GCC tool-chain was version 3.4.3 obtained in binary form from www.gnuarm.org. The Thumb code was also "interworking" (allowing a mix or ARM and Thumb code) because the boot code would only build in ARM mode. In all tests, the benchmark ran from Flash ROM.

Frame of reference

Before any further analysis or reasoning about the results, one should first check whether the values that were measured are "plausible". For instance, if my benchmark results would be above the documented performance of the microcontroller, this should be frowned upon.

Philips documents the LPC2100 chips at 54 "Dhrystone" MIPS at 60 MHz, where 1 MIPS is 1757 Dhrystones per second. According to Philips, the chip can therefore run at 94,878 Dhrystones per second. It is not documented whether this number is measured or theoretical; the Dhrystone benchmark documents the number of assignments, comparisons, jumps and calls that it makes and therefore you could calculate the Dhrystone result with the processor's instruction set (with cycle counts) in hand. If the performance is measured, it probably relates to code execution from SRAM.

Dhrystones and MIPS
The VAX 11/780 was declared a 1 MIPS machine and it ran 1757 Dhrystones per second. Through this relation, Dhrystones are often converted to MIPS. Usually, the Dhrystone version that the MIPS figure refers to is 1.0 or 1.1. Versions 1.1 and 2.1 of the Dhrystone benchmark are similar, but not the same.

For the fastest benchmark, I measured 51,488 Dhrystones per second. This is well below the documented upper bound of 94,878 and therefore, at first sight, my measured results appear plausible.

Rowley Associates did their own benchmark tests and achieved 62,500 Dhrystones per second on an LPC2129 at 60 MHz, see their benchmark page. Rowley Associates achieves this by using an optimized string library.

Observations

In my comparison of an LPC2106 with a simulator (for the LPC2294), I made the following assumptions:

All of the microcontrollers from the LPC2000 series use the same ARM7TDMI core and therefore the benchmarks will have the same effect on the LPC2294 and the LPC2106.
The µVision simulator accurately simulates the LPC2294 plus all its system peripherals such as the on-chip Flash ROM and the Memory Access Module.

The first rule in high performance computing is: assume nothing [Michael Abrash, author of "Zen of Assembly Language"], followed by the second rule: verify everything. This was the trap I stepped in for the earlier version of this paper.

To verify my first assumption, I also ran the benchmark on an LPC2138 (I do not have an LPC2294). Without optimization, this produced 17,441 Dhrystones per second —this should be compared to 18,558 Dhrystones / second on the LPC2106. I have no explanation for this 5% deviation. Perhaps there is a relation with the sector size Flash ROM of 4 KiB versus 8 KiB for the LPC2138 and the LPC2106 respectively (only the first 8 and the last 5 sectors of the LPC2138 are 4 KiB, the others are 32 KiB).

To verify my second assumption, I ran another benchmark that comes with the KEIL µVision evaluation package (Dhrystone 1.1) on both their simulator (choosing the LPC2106 device) and on an LPC2106 from Flash-ROM. The simulator produced 34,362 Dhrystones / second and the actual LPC2106 resulted in 30,960 Dhrystones / second. (These last two numbers relate to Dhrystone 1.1, all other numbers on this page are for Dhrystone 2.1. They cannot be compared with the values on the KEIL page; their only purpose is to show the difference between a simulated LPC2106 and the real LPC2106.)

Both my assumptions were wrong! However, the differences between my measurements those by KEIL remain significant even after correcting for the approximately 15% in total. Note that I ran the Dhrystone from Flash-ROM on the LPC2106, which is always somewhat slower than running it from RAM. The µVision will probably not take the occasional wait state of the Flash ROM into account.

It is common for embedded GNU GCC use to build with the linker options "-nostartfiles" and "-nostdlib", and then to explicitly add the required libraries. This is exactly what I did. It probably accounts for the small size of the executables that I built. If you think this is cheating, I feel otherwise: all of the example code that I have stumbled upon in my learning path to the LPC2100 series has these settings. It is well documented too; if you get your compiler from the www.gnuarm.org site (which almost everyone points you to), installing "newlib" is the default option.

Out of habit, I linked with my own string function library for the benchmarks. My library adds bounds checking; a few #define's map the standard string functions to my safer versions. When I ran the benchmark again with the string functions coming from the newlib library, I got a speed-up of 14%. This leads to two conclusions: my string functions are not very optimized (this I knew) and the Dhrystone benchmark exercises the string functions heavily —this is actually well known, but I was still surprised by the amount. Rowley Associates achieves better benchmarks results than the ones on this page, most likely by using their own optimized C library.

The relative speed of Thumb mode was an unexpected favourable result. The ARM manual explains that even though Thumb mode typically requires more instructions to do the same thing than ARM mode, it also requires less memory. When the memory throughput is the bottleneck, ARM code is hit harder than Thumb code. The Thumb instruction set is also a better fit for the code generators of C/C⁺⁺ compilers. For example, the Thumb instruction set provides the common stack management instructions "PUSH" and "POP", which have to be simulated in ARM mode.

I have used the optimization option "-O2" rather than "-O3" because I appeared to get better results with it. In addition, "-O3" enables inlining and this goes against the instructions for the Dhrystone benchmark. I have made no analysis of what the optimization levels do, but as you can see, they are quite effective for a simple benchmark program. When compared to production code, benchmarking programs are often quite atypical: they contain lots of short loops and long sequences of arithmetic. Compiler optimizations are regularly tested against such benchmarking program. In my experience, the speed-up factor that these general optimization flags bring to production code is far more subtle. In fact, I optimize my programs for size by default and then hand-optimize the critical inner loops if needed.

The startup code that I use with GNU GCC initializes the Memory Access Module (MAM) of the LPC2100 series, along with the CPU and peripheral clocks. The Memory Access Module loads Flash memory into an internal bank of 128 bits at a time and pre-fetches the sequentially next bank while the CPU executes the first bank. This is far from a true LRU cache, of course, but it is a simple method to significantly speed up code that executes linearly (most of the time). To give an impression of the effectiveness of the Memory Access Module, when built without using the MAM, the Dhrystone benchmark for Thumb mode without optimizations ran at 3,214 Dhrystones / second (compare to 15,225 Dhrystones per second for the version with MAM).

Missing analysis...

If you were waiting for an analysis why my numbers are different from those by KEIL, I have to disappoint you. I can only speculate as to how KEIL arrived at its results:

It is likely that KEIL used an inappropriate standard C library, judging on the size of the executable they arrive at.
Some notes from KEIL for other benchmarks suggest that they ran the Dhrystone benchmark with default settings for the compilers that they tested. For the KEIL compiler, this would mean full optimization (bottom of the page) and for the GNU GCC compiler no optimization at all (again, bottom of the page).
Code generation for Thumb mode has reportedly been improved in GNU GCC 3.4.3; KEIL tested with GNU GCC 3.2.2. Perhaps GCC just got significantly better since KEIL published its benchmark results.
The boot.s file that comes with GNU GCC does not initialize the "Memory Access Module" in the LPC2100 series, probably because the maintainers want to keep this tool-chain general purpose. Judging from the effectiveness of the Memory Access Module, the hypothesis that the GNU GCC benchmark ran with the MAM disabled (whereas it was enabled for the other compilers) would explain the difference. It would also mean that what was really benchmarked is the Memory Access Module, and not the compiler.

As I said, all of this is just speculation. Truth is, I have absolutely no idea how KEIL got its results. One final remark: although I can understand that KEIL uses a simulator to run the benchmarks on (because of the ease to get timings up to CPU cycle precision), I regard a benchmark on a simulator as non-authoritative; the results should have been validated on the real device.

Preliminary conclusion

To measure is to understand [James Harrington]. But it is also the other way around: to interpret the measurements (by yourself or by someone else), you must understand what was measured and how the test environment may have influenced the measurements. The difference between my results and those by KEIL, plus the hypotheses that I proposed, should be sufficient evidence that measurements must be analysed, and hypotheses must postulated and verified. Without doing so, you have not gotten much further than not measuring anything at all.

The numbers on this page should not be taken at face value. I did not write this page to show you reliable benchmark results; I wrote it to show how the same benchmark running under essentially the same documented conditions, conducted by two persons/bodies can have results that vary by a factor of five. The key word in the preceding sentence is "documented", because it is obvious that our respective test conditions have been different.

In the paper "Benchmarking in context: Dhrystone", Richard York puts forward the argument that the Dhrystone benchmark is actually not very appropriate for predicting the performance of embedded system software; I recommended that you read this paper. The weight that the benchmark puts on the string functions was a surprise for me. The Dhrystone benchmarks tests the C library as much as it tests the C compiler, which may not be fair when many embedded software engineers use custom libraries (e.g. for use with an RTOS).

In closing:

Through measurement to knowledge
H. Kamerlingh Onnes, 1882

The Dhrystone benchmark, the LPC2106 and GNU GCC

The Dhrystone benchmark, the LPC2106 and GNU GCC

Dhrystone 2.1 - the results

Frame of reference

Observations

Missing analysis...

Preliminary conclusion

Further reading