# Accelerator and F# (III.): Data-parallel programs using F# quotations

If you've been following this article series, you already know that Accelerator is a MSR library [1, 2] that allows you to run code in parallel on either multi-core CPU or using shaders on GPU (see introduction). We also discussed a direct way to use Accelerator from F# (by calling Accelerator methods directly) and implemented Conway's Game of Life. In this article, we'll look at more sophisticated way of using Accelerator from F#. We'll introduce F# quotations and look at translating 'normal' F# code to use Accelerator.

In general, F# quotations allow us to treat F# code as data structure and manipulate with it. This is very similar to C# expression trees, but the F# implementation is more powerful. We can also mark a standard method or a function with a special attribute that tells the compiler to store quotation of the body. Then we can access the quotation and traverse it or modify it. In this article we'll use a function that takes an F# quotation (containing a limited set of functions) and executes it using MSR Accelerator. Implementing this functionality is a bit complicated, so we won't discuss the implementation now. We'll leave this for some future article of this series. In future, we'll also look at other interesting possibilities that we have when writing code using quotations. Here is a list of articles in this series and of the articles that I'm planning to add:

- Accelerator and F# (I.): Introduction and calculating PI
- Accelerator and F# (II.): The Game of Life on GPU
**Accelerator and F# (III.): Data-parallel programs using F# quotations**- Accelerator and F# (IV.): Composing computations with quotations

## Processing code with quotations

Let's start by looking at F# quotations briefly. When you use expression trees in C#,
the compiler decides whether a lambda expression should be compiled as a delegate or
as an expression tree depending on the target type. F# uses a different approach -
when we want to compile code as quotations, we mark it explicitly. The following example
demonstrates a quoted lambda function that implements blurring of F# `Matrix`

values (only in the X coordinate to make the code simpler):

```
> <@ fun input ->
let sum = (shift input -1 0) .+
input .+ (shift input +1 0)
sum /| 3.0f
@>;;
val it : Expr<Matrix<float32> -> Matrix<float32>>
```

You'll understand the implementation of the function in full details after
reading the article. Briefly, the `shift`

function moves values in the
matrix by specified offset (corresponding to `Shift`

in Accelerator).
The `.+`

operator performs point-wise addition of two matrices and finally,
the `/|`

operator is a point-wise division by a scalar value.

However, we're looking at the example to understand F# quotations. As you can see,
the entire function is wrapped inside the `<@ ... @>`

operator.
This tells the compiler to compile the body as a quotation, which is a data structure
representing the code. This is also reflected in the type of the result. The type
is inferred as `Expr<'T>`

where `'T`

is the type
of the wrapped function that we implemented using lambda function syntax. When
we get a value of this type, we cannot execute the function (because it was compiled
as a data structure, not as an executable code). We can use the `Raw`

property to get a value of non-generic type `Expr`

, which can be
analyzed, translated to other language or processed in some other way. F# also
provides an operator `<@@ ... @@>`

which gives us the
untyped *raw* quotation directly.

Later in the article, we'll use quotations for translating F# code to code that's
executed on GPU using Accelerator. We'll take a quoted code that contains some
understood functions and operators (such as `select`

and `.+`

)
and run some processing that gives us a function performing the same operation using GPU.
However, there is one more interesting thing when it comes to quotations. We can also
take a quotation of a standard F# function when it is marked with the
`ReflectedDefinition`

attribute:

```
> [<ReflectedDefinition>]
let blur input =
let sum = (shift input -1) .+
input .+ (shift input +1)
sum /| 3.0f;;
val blur : Matrix<float32> -> Matrix<float32>
```

This time, we wrote a standard function named `blur`

. As you can see,
the inferred type signature is also a usual function. The interesting thing about
the listing is the use of `ReflectedDefinition`

. When the compiler sees
this attribute, it compiles the function into an executable code and *in addition*
stores the quotation of the function. This means that when we later attempt to transform
code `<@@ blur @@>`

using Accelerator, we'll be able to get
the body of the `blur`

function and translate it.

This is a very interesting feature, because we can write an ordinary function
that calculates with the F# `Matrix`

type. We can test and debug this
function, because it is standard executable function. When we know that the function
works correctly, we can take its quotation and run it more efficiently using
Accelerator.

## Data-parallel Matrix operations

I mentioned that we can write our calculations as standard F# programs using
some well-known functions that are understood by a library that evaluates F#
quotations using Accelerator. In this section, we'll discuss these *well-known
functions*. These functions implement similar functionality as Accelerator,
but are implemented as standard F# functions using the generic `Matrix`

type. This means that we can test and debug the code easily in F#. In addition
to these *data-parallel* operations, the translator also allows us to use
a type representing quadruple of floats. In the next section, we'll start with
this type.

### Introducing the float4 type

The `float4`

type is implemented in the files `Float4.fsi`

and
`Float4.fs`

. The first file defines the public interface and the second
one is an implementation file. The implementation follows the standard F# patterns
for implementing a numeric type (so if you'll need to implement your own numeric type,
this example could be a good starting point!) It starts by declaring the `Float4`

type and a type alias `float4`

. Then it defines a module `Float4`

with various useful functions for working with values of the type. Finally, it implements
an intrinsic type extension that adds overloaded operators and uses functions from the
`Float4`

module in the implementation. We'll introduce the type with a single
example:

```
> #r "FSharp.PowerPack.dll";;
> #load "Float4.fs";;
> open System.Drawing
open FSharp.Math;;
> let clr1 = float4(1.0f, 0.5f, 1.0f, 0.0f)
let clr2 = Float4.ofColor Color.Magenta;;
val clr1 : Float4 = (1,0.5,1,0)
val clr2 : Float4 = (1,1,0,1)
> let sum = List.sum [clr1; clr2];;
val sum : Float4 = (2,1.5,1,1)
> sum / Float4.ofSingle(2.0f);;
val it : Float4 = (1,0.75,0.5,0.5)
```

We start by loading the implementation of the type. Note that you may need to specify
the full path to the implementation file. Then we create two `float4`

values.
The first one is initialized using `float4`

function and the second one
is created from a color using one of the conversion functions. Once we have two values,
we sum them using `List.sum`

. This is possible, because we provided an
implementation of `INumeric`

interface, so the F# runtime knows how to
add values of our type. Finally, we divide the sum by 2.0f to get an average value.
As you can see, `float4`

is perfect for representing images and we'll use
it exactly for this purpose shortly.

### Using F# Matrix type

As I already mentioned, we're going to implement calculations using the F#
`Matrix`

type. This type is available in the F# PowerPack in the
`Microsoft.FSharp.Math`

namespace. We're going to use a generic version
of the type (the non-generic one simply stores values of type `float`

).
We can work with it using numerous higher-order functions provided by the F# library:

```
> open Microsoft.FSharp.Math
module Matrix = Matrix.Generic;;
> let m1 = Matrix.init 4 4 (fun y x -> float32(x*10 + y));;
val m1 : Matrix<float32> =
matrix [[0.0f; 10.0f; 20.0f; 30.0f]; [1.0f; 11.0f; 21.0f; 31.0f]
[2.0f; 12.0f; 22.0f; 32.0f]; [3.0f; 13.0f; 23.0f; 33.0f]]
> let m2 = m1 |> Matrix.map (fun v -> sqrt(v));;
val m2 : Matrix<float32> =
matrix [[0.0f; 3.1f; 4.4f; 5.4f]; [1.0f; 3.3f; 4.5f; 5.5f]
[1.4f; 3.4f; 4.6f; 5.6f]; [1.7f; 3.6f; 4.7f; 5.7f]]
```

We started by opening the namespace with various mathematical types and
modules and by creating an alias for a module, which contains functions for working
with generic matrices. Then we use `Matrix.init`

to initialize a
matrix that contains floating-point number representing X and Y coordinates.
The function provided as an argument is called for each element to calculate
the initial element value. In the next step, we use the `Matrix.map`

function to calculate square root of every element in the matrix.

This example isn't particularly complicated or interesting. However, we can use
it to demonstrate different approach for encoding matrix calculations.
So far, we often used operations that calculate with individual elements of the
matrix (the `v`

value inside `Matrix.map`

) or with the
individual coordinates (`x`

and `y`

in `Matrix.init`

).
This is the usual approach, but it can contain very complicated processing
logic with coordinates or values as inputs. This would make translating the
code to GPU code difficult, because the constructs we can use on GPUs are in
many ways limited. In the next section, we'll look at another way of writing
operations with matrices, which is more suitable for automatic translation
to GPU code.

### Data-parallel matrix operations

In this section, we'll look at the functions from the `DataParallel`

module in `FSharp.Math`

namespace. This is a functionality I implemented
(and you'll find download link at the end of the article). It mostly just re-implements
the operations that are available in Accelerator, but for the standard F#
`Matrix`

type, so that we can write standard F# code using data-parallel
operations.

The general aspect of all the operations is that they never explicitly calculate
with coordinates or individual values stored in the matrix. They perform the same
operation on all elements of the matrix, which is the key aspect that makes
translation to GPU code (in our case, implemented using Accelerator) possible.
Let's first look how to implement the same functionality as in previous listing,
using operations from the `DataParallel`

module:

```
> #load @"DataParallel.fs";;
> open FSharp.Math
open FSharp.Math.DataParallel;;
> let posX = positions 4 4 1
let posY = positions 4 4 0;;
val posX : Matrix<int> =
matrix [[0; 0; 0; 0]; [1; 1; 1; 1]
[2; 2; 2; 2]; [3; 3; 3; 3]]
val posY : Matrix<int> =
matrix [[0; 1; 2; 3]; [0; 1; 2; 3]
[0; 1; 2; 3]; [0; 1; 2; 3]]
> let m1Ints = posX *| 10 .+ posY
let m1 = Conversions.singleOfInt m1Ints;;
val m1 : Matrix<float32> =
matrix [[0.0f; 10.0f; 20.0f; 30.0f]; [1.0f; 11.0f; 21.0f; 31.0f]
[2.0f; 12.0f; 22.0f; 32.0f]; [3.0f; 13.0f; 23.0f; 33.0f]]
> let m2 = pointwiseSqrt m1;;
val m2 : Matrix<float32> =
matrix [[0.0f; 3.1f; 4.4f; 5.4f]; [1.0f; 3.3f; 4.5f; 5.5f]
[1.4f; 3.4f; 4.6f; 5.6f]; [1.7f; 3.6f; 4.7f; 5.7f]]
```

The listing starts by loading the file with the implementation (later we'll use it in a project, so you can reference the library or include the file in your project) and by opening the necessary namespaces. Then we start creating the matrix with numbers corresponding to coordinates.

As the first step, we use the `positions`

function. First two arguments
specify the required dimensions (4x4 in our case). The function initializes a new
matrix of this size filled with X or Y coordinates (represented as integers) depending
on the last argument. As you can see, we called it twice. In the first command, we
initialize values with X coordinates - note that all column vectors of the matrix
all the same. In the second case, we initialize matrix with Y coordinates and all the
row vectors are the same.

Now we have matrices to start with, so we can create more complicated matrices
by performing point-wise operations with the initial ones. We start by multiplying
values of `posX`

by a scalar value using operator `*|`

(there
are similar operators for adding scalar `+|`

and others). Then we add
the result with the `posY`

matrix using point-wise addition `.+`

(again, there are similar operators such as `./`

and `.*`

).

On the next line, we convert the type from `Matrix<int>`

to
`Matrix<float32>`

using a function from the `DataParallel.Conversions`

module (you can find other conversion functions in the source code). Now we
get exactly the same result as in the earlier example that used `Matrix.init`

.
As the last step, we use the `pointwiseSqrt`

function, which calculates
square root of each element in the matrix.

As you can see, getting the first matrix was a bit more complicated, because
we have a limited set of matrices to start with. We used the `positions`

function to get matrices with coordinates as values. Other useful starting points
are constants. The library offers numerous constant matrices such as `zerof`

(containing 0.0f) and others, or you can use the `matrixConstant`

function.
On the other hand, calculating square root is a bit simpler, because we have a point-wise
function that operates over matrices. However, the most notable thing about the
code is that we never needed to explicitly work with individual elements or
with coordinates. This raises the level of abstraction and hides the implementation
of operations (which makes it possible to accelerate the code using GPU).

## Implementing data-parallel rotation

Now that we've seen how to write some basic operations with matrices using data-parallel
functions, let's look at a more interesting example. In this section, we'll implement
a simple version of rotation of an image (the rotated photo of Prague in the introduction
was generated by this algorithm). We'll also explore some more functions from the
`DataParallel`

module. In the next section, we'll look how to run the function
on GPU (which is also a reason why the declaration is marked with the
`ReflectedDefinition`

attribute).

```
[<ReflectedDefinition>]
let rotateImage s c w h whalf hhalf (data:Matrix<float4>) =
// Initialize arrays with X and Y coordinates of a bitmap
// and convert numbers to mid-image coordinates
let posX = (Conversions.singleOfInt (positions w h 0)) -| whalf
let posY = (Conversions.singleOfInt (positions w h 1)) -| hhalf
// Calculate rotated coordinates
let rotatedX = posX *| c .- posY *| s
let rotatedY = posY *| c .+ posX *| s
// Convert back to corner coordinates.
let posX = rotatedX +| whalf
let posY = rotatedY +| hhalf
// Calculate X and Y indices and get values at indices
// Note: We'll implement version wiht smoothing later!
let indX = Conversions.intOfSingle posX
let indY = Conversions.intOfSingle posY
gather data indX indY
```

We start with the matrices created by the `positions`

function,
but we immediately subtract a half of the width or height respectively. This
way we get coordinates relatively to the center of the image. The values
`whalf`

and `hhalf`

are passed as parameters to the
function, because they will be calculated in advance on the CPU. Inside the
`rotateImage`

function, we perform only data-parallel operations
on matrices, so that it can be executed on Accelerator using quotations.

Once we have the matrices with coordinates (`posX`

and `posY`

),
we rotate the coordinates. The result will be two matrices (`rotatedX`

and `rotatedY`

) of the same size. These matrices contain the rotated
coordinates. This means that if we have some `x`

and `y`

,
we can get the coordinates `x', y' = rotatedX.[x, y], rotatedY.[x, y]`

.
If we then perform a lookup into the original image `data.[x', y']`

,
we'll get the pixel at the specified `x, y`

location after the rotation.

To calculate the rotated coordinates, we use sine and cosine of the required
angle. These two values are also calculated in advance on the CPU and are
passed as parameters (`c`

and `s`

) to our function. We
calculate the coordinates using point-wise addition and subtraction and using
the multiplication by scalar value. After calculating the rotated values,
we also convert the coordinates back to the corner coordinates, so that
we can perform the lookup in the source image (`data`

).

The last block of code in the function implements the lookup. We first convert
the rotated coordinates to exact locations (in integers). Then we use a
`gather`

function, which is a data-parallel version of lookup
operation. It takes a source matrix (`data`

) and a pair of matrices
with X and Y coordinates (`indX`

and `indY`

). For each
position in these two matrices, it finds the value at specified location in
the source matrix and returns it as an element of the result. If we wanted to
write this operation in F#, we'd calculate the result for each element using
the following equation: `res.[y, x] = data.[indY.[y, x], indX.[y, x]]`

.

Understanding the code for the rotation isn't easy for the first time. You
can also find some useful information in the documentation for Accelerator,
which explains the operations like `gather`

in more details. However,
once you understand the code, you'll be surprised by its elegance.

### Running rotation on CPU

Now we have the core part of the rotation algorithm. We wrote a function that
takes some pre-computed parameters and an input image represented as a matrix
of type `Matrix<float4>`

and returns a rotated image of the
same matrix type. We start by calling this function from an ordinary F# code
running on CPU and later we'll look how to run this function using Accelerator.

We'll first load the input bitmap and implement a utility higher-order function that will be useful for both
CPU and GPU versions. It takes an actual function that performs the rotation as
the first parameter. The second parameter and the return type are the same
and contain the rotated image and the desired angle (`Matrix<float4> * float`

).

```
let bmp = Bitmap.FromFile(Application.StartupPath+"\\prague.jpg") :?> Bitmap
let data = bmp |> Conversions.matrixOfBitmap Float4.ofColor
/// Performs pre-computation of sin and cos values and then
/// invokes the rotation function passed as the first parameter
let rotationStep invokeRotation (_, angle) =
let angle = if (angle >= 359.0) then 0.0 else (angle + 1.0)
let angleRad = Math.PI / 180.0 * angle
let (s, c) = float32(Math.Sin(angleRad)), float32(Math.Cos(angleRad))
let (w, h) = (bmp.Width, bmp.Height)
(invokeRotation s c w h (w / 2) (h / 2), angle)
```

We start by loading a bitmap from the application startup folder and then
convert it to a matrix using one of the utility functions from the
`Conversions`

module. This conversion uses a specified function
to convert individual pixels to the target type of matrix elements, so we
use `Float4.ofColor`

to convert colors to values of type `float4`

.

The `rotationStep`

function first calculates a new angle (incremented by 1).
Then it converts the angle to radians and pre-computes values that are passed
to the actual rotation function (sine and cosine of the angle, half of width and
height of the bitmap). Once we have all the pre-computed values, we invoke
the specified rotation function and return the result together with the new
angle value as the new state.

The last step that we need to do now is to call `rotationStep`

with the first parameter specifying the rotation function running on CPU.
If we specify only the first parameter, we'll get a function that calculates
a new state (bitmap and angle) from the previous step (bitmap and angle).
We'll use a form similar to the one we used when showing the Game of Life to
run the function iteratively and display the result after each step:

```
let run = rotationStep (fun sn cs wid hgt widhalf hgthalf ->
rotateImage sn cs wid hgt widhalf hgthalf data)
let toBitmap = fst >> Conversions.bitmapOfMatrix Float4.toColor
DrawingForm.Run((data, 0.0), run, toBitmap)
```

The first parameter of the form is an initial state. The second parameter is a
function that calculates a new state from the previous one. Finally, the last
parameter is a function that converts the current state (`Matrix<float4> * float`

)
to a bitmap that can be drawn on the form. We create this function using function
composition. We first take the first element of the tuple (a matrix) and
then convert it to bitmap using the `bitmapOfMatrix`

function.

## Accelerating data-parallel F#

You're probably reading this article to learn how to run computations more efficiently on GPU or multi-core CPU, but we're almost at the end of the article and I didn't write a single word on this topic! Don't be disappointed. Actually, all we did in the article so far was a preparation that makes running the code using Accelerator surprisingly easy!

We wrote the `rotateImage`

just using functions from the `DataParallel`

module and we marked the code using `ReflectedDefinition`

attribute.
Now we can use a translator I implemented (which is available as a source
code download below). The translator evaluates quoted data-parallel code using
Accelerator, so we only need to invoke it:

```
open FSharp.Accelerator
open FSharp.Accelerator.EvalTransformation
let target = new Microsoft.ParallelArrays.DX9Target()
let processed = Accelerate.accelerate <@@ rotateImage @@>
let runAcc =
rotationStep (fun sn cs wid hgt widhalf hgthalf ->
eval<Matrix<float4>> target processed
[ makeValue sn; makeValue cs; makeValue wid; makeValue hgt;
makeValue widhalf; makeValue hgthalf; makeValue data ])
let toBitmap = fst >> Conversions.bitmapOfMatrix Float4.toColor
DrawingForm.Run((data, 0.0), runAcc, toBitmap)
```

The code is slightly more complicated than the version running on CPU, but
it is still very simple. We first initialize an Accelerator target (in this
example, we'll run the code on GPU using the `DX9Target`

). Then we
run the `accelerate`

function, which is a part of the translator.
It returns a new function which runs the function enclosed in quotations
using Accelerator. As I already mentioned, thanks to the `ReflectedDefinition`

attribute, the translation function can also look inside the body of
`rotateImage`

.

Once we have the accelerated function, we can use it inside the lambda function
specified as a parameter of `rotationStep`

. We cannot invoke the `processed`

function directly, because all the parameters and the result are encoded in some way.
Instead, we use the generic `eval`

function, which takes the desired
type of result as a type parameter (in our case, we specify that it should return
a value of type `Matrix<float4>`

). The first parameter is the
Accelerator target, the second parameter is the accelerated function to run and
finally, the last parameter is a list of arguments to the accelerated function.
We also need to turn all the parameters into a special value that the `eval`

function expects, which is done using the `makeValue`

function.

## Summary

In this article, we've seen how to work with the F# `Matrix`

type
using data-parallel functions that duplicate the functionality available in
Accelerator, but work with standard F# data types. The implementation of all the
functions from the `DataParallel`

module is a part of the project that
translates data-parallel F# code to Accelerator using F# quotations and you can
get it from the list of downloads below. We've also seen a more complicated data-parallel
calculation when we looked at implementation of an image rotation.

When we write a function using only primitives understood by the translator and
when we mark the function using `ReflectedDefinition`

, the translator
can evaluate the function more efficiently using Accelerator (either using GPU
or multi-core CPU). However, we can still run the code as a standard F# code,
which is very useful when testing and debugging.

In this article, we looked only at the simplest implementation of rotation. When
we used `gather`

to find the location of a rotated pixel, we always
used just the nearest neighbor. We can get better results by collecting several
nearby pixels and interpolating the color depending on the fraction (because the
desired location may be for example [121.3f, 35.8f]. This better version is implemented
in the downloadable source code using the `interpolatef4`

function, which
performs a linear interpolation between `float4`

values.

Finally, in this article we used the `EvalTransformation`

module to
translate the code from quotations to Accelerator. We didn't discuss the translation
architecture in details (we'll do that in the next article!). However, this module
processes quotations and evaluates the code using Accelerator at the same time.
This makes the source code easier to understand (the downloadable source code includes
only this module, so it should be fairly readable). However, a more efficient approach
is to process quotations and build a function that can be executed later. When we'd
want to rotate the image, we'd just invoke the function and we wouldn't have to
analyze the code again. We'll look at this more efficient approach in the next
article.

## Downloads and References

- Download the source code (ZIP, 1.09MB)

- [1] Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses - Microsoft Research
- [2] Accelerator Project Homepage - Microsoft Research
- [3] Microsoft Research Accelerator v2 - Microsoft Connect
- [4] GPGPU and x64 Multicore Programming with Accelerator from F# - Satnam Singh's blog at MSDN

Discuss on twitter, .

Send corrections via GitHub pull requests.