#MonthOfJulia Day 12: Parallel Processing
As opposed to many other languages, where parallel computing is bolted on as an afterthought, Julia was designed from the start with parallel computing in mind. It has a number of native features which lend themselves to efficient implementation of parallel algorithms. It also has packages which facilitate cluster computing (using MPI, for example). We won’t be looking at those, but focusing instead on coroutines, generic parallel processing and parallel loops.
Coroutines are not strictly parallel processing (in the sense of “many tasks running at the same time”) but they provide a lightweight mechanism for having multiple tasks defined (if not active) at once. According to Donald Knuth, coroutines are generalised subroutines (with which we are probably all familiar).
Under these conditions each module may be made into a _coroutine_; that is, it may be coded as an autonomous program which communicates with adjacent modules as if they were input or output subroutines. Thus, coroutines are subroutines all at the same level, each acting as if it were the master program when in fact there is no master program. There is no bound placed by this definition on the number of inputs and outputs a coroutine may have. Conway, Design of a Separable Transition-Diagram Compiler, 1963.
Coroutines are implemented using
consume(). In a moment you’ll see why those names are appropriate. To illustrate we’ll define a function which generates elements from the Lucas sequence. For reference, the first few terms in the sequence are 2, 1, 3, 4, 7, … If you know about Python’s generators then you’ll find the code below rather familiar.
This function is then wrapped in a
Task, which has state
Now we’re ready to start consuming data from the
Task. Data elements can be retrieved individually or via a loop (in which case the
Task acts like an iterable object and no
consume() is required).
Between invocations the
Task is effectively asleep. The task temporarily springs to life every time data is requested, before becoming dormant once more.
It’s possible to simultaneously set up an arbitrary number of coroutine tasks.
Coroutines don’t really feel like “parallel” processing because they are not working simultaneously. However it’s rather straightforward to get Julia to metaphorically juggle many balls at once. The first thing that you’ll need to do is launch the interpreter with multiple worker processes.
There’s always one more process than specified on the command line (we specified the number of worker processes; add one for the master process).
We can launch a job on one of the workers using
@spawnat are macros which launch jobs on individual workers. The
@everywhere macro executes code across all processes (including the master).
Parallel Loop and Map
To illustrate how easy it is to set up parallel loops, let’s first consider a simple serial implementation of a Monte Carlo technique to estimate π.
The quality of the result as well as the execution time (and memory consumption!) depend directly on the number of samples.
The parallel version is implemented using the
@parallel macro, which takes a reduction operator (in this case
+) as its first argument.
There is some significant overhead associated with setting up the parallel jobs, so that the parallel version actually performs worse for a small number of samples. But when you run sufficient samples the speedup becomes readily apparent.
For reference, these results were achieved with 4 worker processes on a DELL laptop with the following CPU: