## #MonthOfJulia Day 12: Parallel Processing

As opposed to many other languages, where parallel computing is bolted on as an afterthought, Julia was designed from the start with parallel computing in mind. It has a number of native features which lend themselves to efficient implementation of parallel algorithms. It also has packages which facilitate cluster computing (using MPI, for example). We won’t be looking at those, but focusing instead on coroutines, generic parallel processing and parallel loops.

## Coroutines

Coroutines are not strictly parallel processing (in the sense of “many tasks running at the same time”) but they provide a lightweight mechanism for having multiple tasks defined (if not active) at once. According to Donald Knuth, coroutines are generalised subroutines (with which we are probably all familiar).

Under these conditions each module may be made into a _coroutine_; that is, it may be coded as an autonomous program which communicates with adjacent modules as if they were input or output subroutines. Thus, coroutines are subroutines all at the same level, each acting as if it were the master program when in fact there is no master program. There is no bound placed by this definition on the number of inputs and outputs a coroutine may have. Conway, Design of a Separable Transition-Diagram Compiler, 1963.

Coroutines are implemented using produce() and consume(). In a moment you’ll see why those names are appropriate. To illustrate we’ll define a function which generates elements from the Lucas sequence. For reference, the first few terms in the sequence are 2, 1, 3, 4, 7, … If you know about Python’s generators then you’ll find the code below rather familiar.

This function is then wrapped in a Task, which has state :runnable.

Now we’re ready to start consuming data from the Task. Data elements can be retrieved individually or via a loop (in which case the Task acts like an iterable object and no consume() is required).

Between invocations the Task is effectively asleep. The task temporarily springs to life every time data is requested, before becoming dormant once more.

It’s possible to simultaneously set up an arbitrary number of coroutine tasks.

## Parallel Processing

Coroutines don’t really feel like “parallel” processing because they are not working simultaneously. However it’s rather straightforward to get Julia to metaphorically juggle many balls at once. The first thing that you’ll need to do is launch the interpreter with multiple worker processes.

There’s always one more process than specified on the command line (we specified the number of worker processes; add one for the master process).

We can launch a job on one of the workers using remotecall().

@spawn and @spawnat are macros which launch jobs on individual workers. The @everywhere macro executes code across all processes (including the master).

## Parallel Loop and Map

To illustrate how easy it is to set up parallel loops, let’s first consider a simple serial implementation of a Monte Carlo technique to estimate π.

The quality of the result as well as the execution time (and memory consumption!) depend directly on the number of samples.

The parallel version is implemented using the @parallel macro, which takes a reduction operator (in this case +) as its first argument.

There is some significant overhead associated with setting up the parallel jobs, so that the parallel version actually performs worse for a small number of samples. But when you run sufficient samples the speedup becomes readily apparent.

For reference, these results were achieved with 4 worker processes on a DELL laptop with the following CPU:

More information on parallel computing facilities in Julia can be found in the documentation. As usual the code for today’s Julia journey can be found on github.