There is probably a ongoing trend of adopting arm chips to commercial and even high performance computing world. I got a chance to try out the new Raspberry Pi 4 with 64-bit quad core processor running at 1.5 GHz and 8 GB of memory. The base frequency is 0.6 GHz, and when running tasks, it will reach 1.5 GHz. The baseline temperature at 0.6 GHz is 41 degrees Celcius. Earlier this year they launched a true 64-bit Linux distribution for arm. Now it’s time to get a first-hand feeling about it.
Julia has an official distribution for the 64-bit arm system. Running my unit test for the Batsrus.jl package takes 33.2 seconds, while on my MacBookPro 2016 it’s around 9.5 seconds. So it’s about 3.5x slower on a single core. It is possible to overclock to about 2.05 GHz. At this clock rate, the test finishes in 27.4 seconds, with highest temperature reaching 47.7 degrees Celcius.
I tried to test my plotting package in Julia, but had no luck in a successful installation.
As for the multi-core performance, I tried my version of BATSRUS. When I compiled for the first time, it complained to me during linking phase:
FaceGradient.f90:(.text+0x3bb4): relocation truncated to fit: R_AARCH64_TLSLE_ADD_TPREL_HI12 against symbol `__facegradient_MOD_dcoorddxyz_ddfd' defined in .tbss section in Tmp_/FaceGradient.o
Searching on the web leads me to an error related to 32/64 bit system.
According to the suggestion, adding
-fPIC flag solved the problem.
Then for the shocktube test, compiled with
gfortran -O3, the results are shown in the following table:
|Platform||# of MPI||# of threads||Timings [s]|
|Pi ARM 1.5 GHz||1||1||2.97|
|Pi ARM 1.5 GHz||1||2||1.85|
|Pi ARM 1.5 GHz||1||4||1.55|
|Pi ARM 1.5 GHz||2||1||2.12|
|Pi ARM 1.5 GHz||4||1||2.27|
|Pi ARM 2.05 GHz||1||1||2.20|
|Pi ARM 2.05 GHz||1||2||2.40|
|Pi ARM 2.05 GHz||1||4||2.82|
|Pi ARM 2.05 GHz||2||1||1.75|
|Pi ARM 2.05 GHz||4||1||2.06|
|Mac i7 2.2 GHz||1||1||1.18|
|Mac i7 2.2 GHz||1||2||0.85|
|Mac i7 2.2 GHz||1||4||0.67|
|Mac i7 2.2 GHz||2||1||0.82|
|Mac i7 2.2 GHz||4||1||0.68|
|x86 Cascade Lake 2.7 GHz||1||1||1.19|
|x86 Cascade Lake 2.7 GHz||1||2||0.88|
|x86 Cascade Lake 2.7 GHz||1||4||0.72|
|x86 Cascade Lake 2.7 GHz||2||1||1.22|
|x86 Cascade Lake 2.7 GHz||4||1||0.75|
So it’s about 2.5x slower on both single core and multi-cores compared to MacBookPro and one of the top supercomputers. Still reasonable. No idea why the overclock performance degrades for parallel runs. With 4 MPI processes, the highest temperature reaches about 58 degrees Celcius. With 4 threads, the highest temperature reaches 50 degrees, and it’s about 30% slower than 4 MPI. What’s puzzling me is that the parallel performance is in general really bad.
However, I was told the other day that a Monte Carlo code that is using random number generators gives about 1% difference in the result of a very accurate molecular dynamics simulation compared to AMD and Intel CPUs. This is a big warning flag if you want to dive deep into ARM.
Until now, I still don’t understand why a presumably multi-core program like MPI written in C/C++ or Fortran will trigger the fan while Julia ones don’t.