ARM-64 Architecture

Performance Benchmark

hardware

programming

Author

Hongyang Zhou

Published

September 10, 2020

There is probably a ongoing trend of adopting arm chips to commercial and even high performance computing world. I got a chance to try out the new Raspberry Pi 4 with 64-bit quad core processor running at 1.5 GHz and 8 GB of memory. The base frequency is 0.6 GHz, and when running tasks, it will reach 1.5 GHz. The baseline temperature at 0.6 GHz is 41 degrees Celcius. Earlier this year they launched a true 64-bit Linux distribution for arm. Now it’s time to get a first-hand feeling about it.

Julia has an official distribution for the 64-bit arm system. Running my unit test for the Batsrus.jl package takes 33.2 seconds, while on my MacBookPro 2016 it’s around 9.5 seconds. So it’s about 3.5x slower on a single core. It is possible to overclock to about 2.05 GHz. At this clock rate, the test finishes in 27.4 seconds, with highest temperature reaching 47.7 degrees Celcius.

I tried to test my plotting package in Julia, but had no luck in a successful installation.

As for the multi-core performance, I tried my version of BATSRUS. When I compiled for the first time, it complained to me during linking phase:

FaceGradient.f90:(.text+0x3bb4): relocation truncated to fit: R_AARCH64_TLSLE_ADD_TPREL_HI12 against symbol `__facegradient_MOD_dcoorddxyz_ddfd' defined in .tbss section in Tmp_/FaceGradient.o

Searching on the web leads me to an error related to 32/64 bit system. According to the suggestion, adding -fPIC flag solved the problem.

Then for the shocktube test, compiled with gfortran -O3, the results are shown in the following table:

Platform	# of MPI	# of threads	Timings [s]
Pi ARM 1.5 GHz	1	1	2.97
Pi ARM 1.5 GHz	1	2	1.85
Pi ARM 1.5 GHz	1	4	1.55
Pi ARM 1.5 GHz	2	1	2.12
Pi ARM 1.5 GHz	4	1	2.27
Pi ARM 2.05 GHz	1	1	2.20
Pi ARM 2.05 GHz	1	2	2.40
Pi ARM 2.05 GHz	1	4	2.82
Pi ARM 2.05 GHz	2	1	1.75
Pi ARM 2.05 GHz	4	1	2.06
Mac i7 2.2 GHz	1	1	1.18
Mac i7 2.2 GHz	1	2	0.85
Mac i7 2.2 GHz	1	4	0.67
Mac i7 2.2 GHz	2	1	0.82
Mac i7 2.2 GHz	4	1	0.68
x86 Cascade Lake 2.7 GHz	1	1	1.19
x86 Cascade Lake 2.7 GHz	1	2	0.88
x86 Cascade Lake 2.7 GHz	1	4	0.72
x86 Cascade Lake 2.7 GHz	2	1	1.22
x86 Cascade Lake 2.7 GHz	4	1	0.75

So it’s about 2.5x slower on both single core and multi-cores compared to MacBookPro and one of the top supercomputers. Still reasonable. No idea why the overclock performance degrades for parallel runs. With 4 MPI processes, the highest temperature reaches about 58 degrees Celcius. With 4 threads, the highest temperature reaches 50 degrees, and it’s about 30% slower than 4 MPI. What’s puzzling me is that the parallel performance is in general really bad.

However, I was told the other day that a Monte Carlo code that is using random number generators gives about 1% difference in the result of a very accurate molecular dynamics simulation compared to AMD and Intel CPUs. This is a big warning flag if you want to dive deep into ARM.

Until now, I still don’t understand why a presumably multi-core program like MPI written in C/C++ or Fortran will trigger the fan while Julia ones don’t.