Explicit speculative parallelism for Haskell's Par monad
Haskell provides quite a few ways for writing parallel programs, but none of them is fully automatic. The programmer has to use some annotations or library to run computations in parallel explicitly. The most recent paper (and library) for writing parallel programs follows the latter approach. You can find more information about the library in a paper by Simon Marlow et al. A monad for deterministic parallelism  and it is also available on Hackage. However, I'll explain all the important bits that I'll use in this article.
The library provides an explicit way for writing parallel programs using a
Par monad. The library contains
constructs for spawning computations and sharing state using blocking variables. However, the whole programming model
is fully deterministic. I believe that it is sometimes useful to lift the determinacy requirement.
Some programs are deterministic, but the fact cannot be (easily) automatically verified. For example, say you have two functions
fib2. They both give the same result, but each of them is more efficient
than the other one on certain inputs. To calculate a Fibonacci number, the program could speculatively
try to evaluate both functions in parallel and return the result of the one that finishes first.
Unfortunately, this cannot be safely implemented using a fully deterministic library. In this article, I'll show some
examples where speculative parallelism can be useful and I'll talk about an extension to the
that I implemented (available on GitHub). The extension allows
programmers to write speculative computations, provided that they manually verify that their code is
In this section, I'll show a basic example that demonstrates how the
Par monad works. I'll use a
contrived example of calculating Fibonacci numbers to keep things simple. You can find better examples in
Par monad examples repository.
The performance measurements that I'll include in this article are just for demonstration (executed
on Intel Core 2 Duo CPU). I'll also briefly introduce the
unamb operator  designed by
Conal Elliott. This solves a very
important problem, originally in Functional Reactive Programming , which is closely related to
Deterministic parallelism monad
The parallelism wraps a computation inside a monad. A parallel version of a function that returns
Int64 will have a return type
Par Int64. The type represents a computation
that can be started (using the
Here is a basic example:
-- Naive calculation of Fibonacci numbers fib :: Int64 -> Int64 fib x | x < 1 = 1 fib x = fib (x-2) + fib (x-1) -- Parallel version of the function pfib :: Int64 -> Par Int64 pfib n | n < 25 = return (fib n) pfib n = do av <- spawn (pfib (n - 1)) b <- pfib (n - 2) a <- get av return (a + b)
fib function is a usual (non-parallel) function for calculating Fibonacci numbers. The
pfib function is a parallel version (as you can see from the type signature). When the argument
is smaller than 25, it runs sequential
fib to avoid generating too many small work items.
If the argument is sufficiently large, the function spawns a computation to calculate
pfib (n-1) in parallel. The type of the
spawn function is:
spawn :: NFData a => Par a -> Par (IVar a)
NFData requirement specifies that it should be possible to fully evaluate the
data structure. The rest of the signature says that the function takes some computations and returns
a variable (wrapped in
Par) that will eventually contain the result of the
computation. When writing code in the
Par monad, we can get a value from the variable
get function. The function blocks the computation until a value becomes
available and has the following type:
get :: IVar a -> Par a
Now it should be easy to understand the rest of the
pfib body. It spawns one recursive
computation (as a background task), then runs the second recursive call on the main thread and then
waits until the background operation completes. Finally, it returns the sum of the results.
The following simple snippet shows how to run the two functions and it shows some approximate performance measurements:
main = do -- Naive sequential fibonacci (1.8 sec) print $ fib 32 -- Parallel version with 25 threshold (1.0 sec) print $ runPar $ pfib 32
When we have a value of type
Par a, we can evaluate it (to get an
runPar function. When running the parallel function on dual-core machine,
it gives you a slightly less than 2x speedup. For comparison, a version implemented using the
pseq combinators (which is a lower-level parallelism mechanism
in GHC) completes in approximately 0.95 seconds.
Finally, it is worth mentioning that the programming model used by the
Par monad is
very similar to the model used by
async in F#. The
Async.StartChild and the
runPar does the same thing as
Before discussing the support for speculative programming, I'd like to shortly discuss the unambiguous choice operator, which was created by Conal Elliott for his Functional Reactive Programming (FRP) implementation, but is also available as a separate package on Hackage.
The operator is a function of type
a -> a -> a. It takes two arguments, starts
evaluating them concurrently and returns a value that becomes available first.
When using the operator, the programmer needs to make sure that the two arguments
are compatible. This means that if they both finish evaluating, their
values are equal. Formally, the requirement looks like this:
compatible a b = (a = ⊥) | (b = ⊥) | (a = b)
The operator can be used for two purposes. First, it can be used to deal with computations
that do not terminate. For example, say you have an expression
a && b. When
False, the result of the expression should be (logically) also
a doesn't terminate, the overall expression will also never
yield a value, because it will start evaluating
a first. Using
you can write:
import Data.Unamb c = unamb (a && b) (b && a)
This use clearly matches the compatibility requirement. When both arguments terminate, the
result is same in both of the cases. However, the expression returns
a is non-terminating computation, because the second argument will
The second possible use of
unamb is for simple speculative evaluation.
I described a typical scenario earlier. We have two functions and one is faster for some of
the inputs, but not for all of them. Using
unamb, we can write the following
helper that takes two functions and returns the value that is produced first:
tryBoth :: (t -> a) -> (t -> a) -> t -> a tryBoth f1 f2 n = unamb (f1 n) (f2 n)
To demonstrate how the function works, I implemented a more clever function for calculating
Fibonacci numbers (
ffib), which counts from zero and keeps adding last two
generated numbers. If you want to play with it, you can find the complete source code at
the end of the article. The following example shows the results:
main = do -- Naive recursive implementation (~1.7 sec) print fib 32 -- Efficient implementation (~0.01 sec) print ffib 32 -- Tries evaluating result using both functions (~0.01 sec) tryBoth fib ffib 32
In this example, we could just call
ffib, because it will be always faster.
However, that's not always the case. You might have a very fast function that almost always
works, but sometimes fails and a backup function that is slow, but always works.
unamb operator doesn't break determinacy of code only when it is
used properly. The programmer needs to make sure that the arguments are compatible.
However, I believe that this kind of construct is very useful and I hope that the previous
examples demonstrated that. In the rest of the article, I'll describe my extension that
allows you to use similar concepts in the
Speculative parallelism for Par monad
The extension (available on GitHub) adds support for explicit cancellation of computations. When starting a computation, you can give it a cancellation token. The token can be later used (from another computation) to cancel the background task. The design of the extension is very similar to how cancellation works in F# asynchronous workflows and in Task Parallel Library (TPL) in .NET.
Parallel unambiguous choice
The support for cancellation is a low-level control structure and it is not expected that
it will be used directly very often. However, you can use it to implement your own functions
that provide higher-level of abstraction. For example, it is quite easy to implement
unamb operator for
Par a values:
-- Unambiguous choice between two parallel computations -- Assumes that computations p1 and p2 are compatible punamb :: NFData a => Par a -> Par a -> Par a punamb p1 p2 = do res <- newBlocking ct1 <- newCancelToken ct2 <- newCancelToken forkWith ct1 (do a <- p1; put res a; cancel ct2) forkWith ct2 (do a <- p2; put res a; cancel ct1) v <- get res return v
The type of the function specifies that the value returned by the function
can be fully evaluated, but it is equally easy to define a version that evaluates the
value to a head-strict form (using
put_ instead of
The body of the
punamb first creates some object for implementing the synchronization.
newBlocking to create a variable (of type
IVar a) for storing the result.
A variable created using
newBlocking behaves slightly differently than standard variables created
new. When you attempt to write a value to a standard variable for the second time, the
writing fails and causes an exception. However, variables created using
differently - instead of causing an exception, the second write is just blocked (indefinitely) and the
operation never completes. This makes it possible to implement a race between two computations.
The next two values
ct2 are cancellation tokens that are later used
to cancel running computations.
After the initialization, the body starts two background computations using the
The function takes a
CancelToken as the first argument and a computation of type
as the second argument. When called (from the
Par) monad, it starts the computation in
background and associates the cancellation token with it. This allows others to cancel the computation.
The two computations started using
forkWith are quite simple. Each of them evaluates
one of the
punamb's arguments, then attempts to cancel the other computation and
stores the result to the
res variable. In a typical case, one of the computations
completes first and canceles the other (which is still running). However, if both complete at the
same time, then one wins the race and writes value to
res. The other attempts to write
the result, but gets blocked. The race doesn't affect determinism, as long as the two computations
given as arguments to
punamb are compatible.
The following snippet uses
punamb to implement an example similar to the
earlier demonstration of
pffib function (not shown in this
article) is a faster implementation of Fibonacci numbers embedded in the
main = do -- Naive recursion using Par monad (~1.1 sec) print $ runPar $ pfib 32 -- Efficient recursion using Par monad (~0.01 sec) print $ runPar $ pffib 32 -- Tries evaluating both & returns the first (~0.04 sec) print $ runPar $ punamb (pfib 32) (pffib 32)
As you would expect, the
punamb operation behaves similarly to
When given two computations, it finishes after the first one completes. The time is slightly larger,
because the first computation (running in parallel) also occupies some of the resources.
You may be wondering how the cancellation works - at which point is a
computation cancelled when another computation calls
cancel. The following
section briefly explains how is the cancellation implemented in my modification of the package.
How are computations cancelled?
The cancellation for the
Par monad is cooperative, which means that the
computations can be cancelled only when they are written in a particular way that permits
cancellation. In particular, the
cancel function will not be able to cancel a
Par computation that is created by writing
return foo (where
foo is some expression).
A computation can be cancelled at points when it performs some special computation in
Par monad. Most importantly, this includes the monadic binding (as well as
other special calls like forking). The
pfib function (demonstrated above)
contains many such points. It can be cancelled when it tries to spawn a recursive call (in
background), when it recursively calls itsef as well as when it tries to read the result
of the background computation.
You can exlpore the source code of my modification on GitHub. The commit that adds the cancellation support
shows the difference compared with the original version. Implementing an un-cooperative
cancellation (as, for example, in
unamb) would be also possible, but more
difficult. However, I believe that the cooperative cancellation model works very well
Par monad, because parallel computations generally consist of
large number of small (atomic) steps.
To finish the blog post, the last section shows a single larger example of implementing tree processing as a speculative computation explicitly using cancellation.
Speculative tree processing
In this section, I present a larger example that uses the cancellation support to implement
a speculative processing of a tree. We look how to implement a
forall function that checks whether
a specified predicate holds for all leaves of the tree. When the function returns
False for some
value, the overall result will also be false. In this case, the function can cancel all parallel tasks and return
The example uses the following simple tree structure:
data Tree a = Leaf a | Node (Tree a) (Tree a)
The following snippet shows a basic parallel version of
forall. It uses the same recursive
parallel processing pattern that was used earlier in the
-- Test whether all elements match the given predicate (in parallel) forallp :: (a -> Bool) -> Tree a -> Par Bool forallp p (Leaf num) = return $ p num forallp p (Node left right) = do al' <- spawn $ forallp p left ar <- forallp p right al <- get al' return (al && ar)
When processing a leaf, the function calls the predicate and returns the result. When processing a node,
it spawns processing of the left sub-tree as a background operation (using
spawn, which is
fork) and then processes the right sub-tree. After both operations
complete, it returns logical conjunction of the two results.
When implementing the speculative version, we need to spawn two background computations (to start
processing the left and right sub-tree). When any of them completes returning
can immediately store the result in some variable, which unblocks the main computation (and it can
return the overall result). When the result is
False, the process needs to wait until the
other computation completes before it can set the final result.
The following snippet shows the speculative version:
-- Test whether all elements match the given predicate -- (Speculatively - False is returned immediately) foralls :: (a -> Bool) -> Tree a -> Par Bool foralls p tree = do tok <- newCancelToken r <- forall' p tok tree cancel tok return r where -- Recursively start computations using the same cancellation token forall' p tok (Leaf v) = return (p v) forall' p tok (Node left right) = do leftRes <- new rightRes <- new finalRes <- newBlocking forkWith tok (forall' p tok left >>= completed leftRes rightRes finalRes) forkWith tok (forall' p tok right >>= completed rightRes leftRes finalRes) get finalRes -- Called when one of the branches complete. Write result immediately -- if it is False, otherwise wait for the second computation. completed varA varB fin resA = do put varA resA if not resA then put fin False else get varB >>= put fin . (&& resA)
The main body of the function creates a new
CancelToken and then calls a helper that does
the actual processing. The cancellation token is used when starting any background computations during the
recursive processing. As a result, when the helper function returns
False (meaning that some
computations may be still running), we can cancel all (still running, but unnecessary) computations
using this single token.
The function that implements the recursive processing first creates three variables. The first two
rightRes) are used by their corresponding computations (when a
left computation completes, it writes result to
leftRes). The last variable is created using
newBlocking function, because it can be accessed by any of the two computations. Its value
is set to the final result as soon as possible. The function then spawns two tasks to process sub-trees
and waits for the final result.
The two background computations spawned by the function both make a recursive call and then pass the
result to the
completed function. It takes all three variables (variable of the currently
completed computation, variable of the other computation and a final result variable) and the result
of the computation. If the result is
False, then it sets the final result. Otherwise, it
waits until the other computation completes (by blocking on the other variable) and then sets the
final result (in this case, the computation may race with the other call to
but that is not a problem - they both produce the same value and the second one will just block
when it attempts to write).
The complete source code contains a full example that you can run to test the function. Assuming we have
a tree value named
tree, we can write the following test. It demonstrates the performance of a
simple sequential version, parallel version and a speculative version from the last listing:
main = do -- Binary tree contains ~270 large primes and a single -- non-prime number. The tree is fully evaluated. -- Single-threaded version (see source code link) (~3.7 sec) measure "Non-prime num (Seq) " (forall isPrime tree2) -- Simple parallel version (~1.9 sec) measure "Non-prime num (Par) " (runPar $ forallp isPrime tree2) -- Parallel version with shortcircuiting (~0.01 sec) measure "Non-prime num (Shr) " (runPar $ foralls isPrime tree2)
The example generates a tree containing large prime numbers and one number that is not a prime. Then it ensures that the tree is fully evaluated and processes it using different techniques. Parallel processing is roughly two times faster than sequential version, but a speculative version is faster by orders of magnitude. This is, of course, the case only when the tree contains at least one non-prime number. The performance depends on the data, but in general, the speculative parallelism can be used to implement various useful heuristics.
In this article, I demonstrated a couple of extensions that I implemented for the
monad that has been recently proposed by Simon Marlow, Ryan Newton and Simon Peyton Jones. The
extension adds functions for cancellation of computations in the monad.
The support for cancellation should be viewed as a low-level mechanism. In practice, it can be
used to implement useful higher-level abstractions such as the
unamb operator designed
by Conal Elliott. The operator provides a high-level way for writing speculative computations
(or working with partially undefined functions). When using the operator, the developer is
responsible for ensuring that the arguments are compatible (will give the same result).
In this article, I showed an implementation of
unamb for the
and also a larger tree-processing example. The tree processing example cannot be elegantly written
unamb, because it requires more expressive abstraction. This abstraction could be,
for example, a Haskell version of joinads , a language extension that I designed with Don
Syme during an internship at Microsoft Research. I'll write more about that in some future blog post...
Downloads & Source code
- Implementation of the
Parmonad with cancellation (GitHub)
- Complete source code of examples from the article (GitHub)
-  A monad for deterministic parallelism (PDF) - Simon Marlow, Ryan Newton, Simon Peyton Jones
-  Push-pull functional reactive programming - Conal Elliott
-  Unamb - Haskell Wiki
-  Joinads: a retargetable control-flow construct for reactive, parallel and concurrent programming - Tomas Petricek, Don Syme
- RE: Explicit speculative parallelism for Haskell's Par monad by Svick (5/17/2011 8:19:25 PM)
Do I understand it correctly that the following scenario can't happen in punamb, because there are no possible points where a computation can be canceled?
1. The first computation completes
2. The second computation completes
3. The first computation cancels the second computation
4. The second computation cancels the first computation
5. The first computation is canceled without setting the result variable
6. The second computation is canceled without setting the result variable
- RE: Explicit speculative parallelism for Haskell's Par monad by Don Stewart (5/18/2011 7:31:51 AM)
> but none of them is fully automatic
Actually, quick nit-pick. array programming in Haskell with "repa arrays" is automatically parallel (it uses a strict multi-dimensional data parallel model, with a well-defined cost model, making it feasible to automatically parallelize the Haskell code.
It's more correct to say that there are a wide range of parallel programming models in Haskell, ranging from full explicit, to annotations and semi-implicit, to automatic.
- RE: Explicit speculative parallelism for Haskell's Par monad by Tomas (5/26/2011 1:36:43 AM)
@Svick: Thanks for pointing this out. This is a problem with the sample - it should first call 'put' and then call 'cancel' (I'll correct the blog post). When changed like this, the only tricky thing is that a computation that calls 'put' as the second doesn't get to call 'cancel' (because it will be blocked), but that's not a problem because the other computation has completed.
In principle, it would be also possible to guarantee that there are no possible interleaving points between the two calls. However, the do-notation inserts a call to monadic 'bind' and this is a cancellable operation (although, it probably doesn't need to be).
- RE: Explicit speculative parallelism for Haskell's Par monad by Tomas (5/26/2011 1:46:26 AM)
@Don: Yes, I understand your point. Declarative parallelism (as with repa or Parallel LINQ) is a fully automatic DSL, but you still need to use the DSL and explicitly mark a part of your program in some way.
I just wanted to highlight that - Although you can, in theory, automatically parallelize any purely functional program, this simple solution just doesn't work. Even with declarative array parallelism, the compiler doesn't decide which parts of ordinary program could use & benefit from this method.