Raspberry Pi3 against DragonBoard. Reply to criticism / Blog of the company DataArt / SurprizingFacts

Author: Nikolai Khabarov, Embedded Expert DataArt, evangelist of smart house technologies.

The test results given in the article Comparing the performance of Raspberry Pi3 and DragonBoard boards when working with Python applications, raised doubts among some colleagues.

In particular, the following material appeared under the material:

"… I did benchmarks Between 32-bit ARMs, between 64-bit and between Intel x86_64 and all the numbers were comparable. At least between 32-bit and 64-bit ARMs, the difference was tens of percent, and not at times. Well, or you just have a mere purely -cpu-max-prime pointed out. "

" Amazing results usually mean a mistake in the experiment. "

There is a suspicion that there is some mistake in the CPU test. I personally tested different ARMs sysbench'om, but the difference was 25 times and was not close. In principle, a good ARM media in the CPU test can be several times more efficient than the BCM2837, but not as much as 25 times. I suspect that the test for pi was done in one thread, and for DragonBoard in 4 threads (4 cores). "

This is a test cpu from the sysbench test suite. The answer to these assumptions turned out to be so voluminous that I decided to publish it in a separate post, at the same time telling about why in some problems the difference can be so colossal.

Let's start with the fact that the teams with all the arguments for the test were specified In the table of the original article. Of course, there is no argument to -cpu-max-prime or other arguments that force multiple CPU cores to be used. In part about the 10-20% difference, perhaps, was meant a test of the overall system performance, which on real applications (not always, of course, but, most likely) will just show 10-20% of the difference between 32-bit and 64 Bit mode of the same processor.

In principle, you can read how mathematical operations with a bit capacity of a larger bit of a computer word are implemented, for example, here here. There is no sense to rewrite algorithms. Say, multiplication will take approximately 4 times the processor clock cycles (three multiplications + addition operations). Naturally, this value can vary from processor to processor and depending on the compiler optimization. For example, for an ordinary x86 processor, there may not be a difference, since with the advent of the MMX instruction set, it was possible to use 64-bit registers and 64- 32-bit processor. And with the advent of SSE appeared 128-bit registers. If the program is compiled using such instructions, then it can be executed even faster than 32-bit calculations, a difference of 10-20% and even more can be observed already in the other direction, since the same MMX instruction set can perform several operations

But it's still about a synthetic test that explicitly uses 64-bit numbers (the source code is available here), and since the package is taken from the official repository, it's not a fact that all possible Optimization (all because of the same compatibility with Other ARM-processors). For example, ARM processors starting with v6 support SIMD, which, like MMX / SSE on x86, can work with 64-bit and 128-bit arithmetic. We did not aim to squeeze as many "parrots" out of the tests as possible, we are interested in the real situation when installing applications "out of the box", because we do not want to mount another half of the operating system.

Still do not believe , That even out of the box, the speed on the same processor can not differ by a factor of ten depending on the processor mode?

Well, let's take the same DragonBoard.

  Sysbench --test = cpu run  

This time with screenshots:

12.4910 seconds. Ok, now on the same board:

  sudo dpkg --add-architecture armhf
Sudo apt update
Sudo apt install sysbench: armhf  

With these commands, we installed the 32-bit version of the sysbench package on the same DragonBoard.

And again:

  sysbench --test = cpu run  

And here it is a screenshot (above is the output of apt install):

156.4920 seconds. The difference is more than 10 times. Since we are talking about such cases, let's see why. Let's write here such unpretentious program on With:


Int main (int argc, char ** argv) {
    Volatile uint64_t a = 0x123;
    Volatile uint64_t b = 0x456;
    Volatile uint64_t c = a * b;
    Printf ("% lu  n", c);
    Return 0;

The keyword volatile is used so that the compiler does not count everything in advance, namely, assign variables and make an honest multiplication of two arbitrary 64-bit numbers. We will compile the program for both architectures:

  arm-linux-gnueabihf-gcc -O2 -g main.c -o main-armhf
Aarch64-linux-gnu-gcc -O2 -g main.c -o main-arm64  

And now look at the disassembler for arm64:

  $ aarch64-linux-gnu-objdump -d main-arm64  

The mul instruction is pretty predictably used. And now for armhf:

  $ arm-linux-gnueabihf-objdump -d main-armhf  

As you can see, the compiler used one of the methods Long arithmetic. And as a consequence, we see a whole bucket, which uses, among other things, rather heavy-handed instructions mul, mla, umull. Hence, there is a multiple difference in performance.

Yes, you can also try to compile by including a set of instructions, but then we may lose compatibility with any processor. Again, let's repeat, we were interested in the real speed of the entire board with real binary packages. We hope, this reasoning, why on a particular test cpu this difference was obtained is enough. And you will not be embarrassed by such breaks in some tests and, possibly, some application programs.