So last week I wrote a post about how to build and configure a Raspberry Pi cluster to make use of MPI – the Message Passing Interface for parallel computing.
Today I’m going to show my initial results from testing it’s performance characteristics.
If you followed my previous post, you’ll know that I constructed my Pi Cluster from the new Raspberry Pi 2 boards. This board is a big step up from the previous board in compute terms: it’s powered by a 900MHz quad-core ARM Cortex-A7 CPU, as opposed to the 700Mhz, single core processor in the earlier Model B Pi. There are other differences too (double the memory for example), but this shows that for simple workloads that are bounded only by compute power, the new Pi should provide a good engine.
Having built and configured the cluster with
mpi4py, as in the earlier post, the only other change I made was to run NFS on one of the pi’s, and to mount an exported share across all of the cluster nodes (including the nfs server itself) so that source files can be easily shared across the cluster, and with identical file-locations/paths, something which appears essential in order for mpi4py to operate.
I based my testing upon one of the simple demo programs that comes with mpi4py – appropriately enough a tool for calculating the value of pi in a distributed fashion. You can find the calculation in the
mpi4py comes with 3 different implementations of this calculation, each sharing workload across the cluster using different MPI techniques.
I took just one of these,
cpi-cco.py which operates using Collective Communication Operations (CCO) within Python objects exposing memory buffers (requires NumPy), and made some trivial changes to remove the looping so that each execution of the program only calculates Pi once, and to allow the number of iterations to be specified on the command line as a parameter. This allowed me to simply use the linux time command to measure the execution time. You can retrieve a gist of the modified file here and execute the program thus:
~# time mpiexec -n 32 -f /mnt/shared/machinefile python /mnt/shared/cpi-cco.py 1000
pi is approximately 3.1415927369231271, error is 0.000000083333334
In this example I have shared the /mnt/shared folder across all nodes in the cluster, and as in the previous post, the file machinefile lists the cluster-members, and the -n parameter specifies the number of processes to use.
For the experiment itself, I chose an iteration count of 1 million, and ran the tests twice – once without the machinefile parameter (and thus constrained to the single board), and once with the cluster fully utilised.
The results nicely match the expected behaviour:
The performance improves as long as there are spare cores available – whether on the single board, or across the cluster.
For this simple example, with load characteristics that are not throttled by memory or network IO, and with consistent behaviour and processing-cost per process, it’s no surprise to see an excellent fit with the straight-line log scaling expected, with a breakaway happening only once all available cores are utilised.