Accelerator and F# (III.): Data-parallel programs using F# quotations

If you've been following this article series, you already know that Accelerator is a MSR library [1, 2] that allows you to run code in parallel on either multi-core CPU or using shaders on GPU (see introduction). We also discussed a direct way to use Accelerator from F# (by calling Accelerator methods directly) and implemented Conway's Game of Life. In this article, we'll look at more sophisticated way of using Accelerator from F#. We'll introduce F# quotations and look at translating 'normal' F# code to use Accelerator.

In general, F# quotations allow us to treat F# code as data structure and manipulate with it. This is very similar to C# expression trees, but the F# implementation is more powerful. We can also mark a standard method or a function with a special attribute that tells the compiler to store quotation of the body. Then we can access the quotation and traverse it or modify it. In this article we'll use a function that takes an F# quotation (containing a limited set of functions) and executes it using MSR Accelerator. Implementing this functionality is a bit complicated, so we won't discuss the implementation now. We'll leave this for some future article of this series. In future, we'll also look at other interesting possibilities that we have when writing code using quotations. Here is a list of articles in this series and of the articles that I'm planning to add:

Accelerator and F# (I.): Introduction and calculating PI
Accelerator and F# (II.): The Game of Life on GPU
Accelerator and F# (III.): Data-parallel programs using F# quotations
Accelerator and F# (IV.): Composing computations with quotations

Processing code with quotations

Let's start by looking at F# quotations briefly. When you use expression trees in C#, the compiler decides whether a lambda expression should be compiled as a delegate or as an expression tree depending on the target type. F# uses a different approach - when we want to compile code as quotations, we mark it explicitly. The following example demonstrates a quoted lambda function that implements blurring of F# Matrix values (only in the X coordinate to make the code simpler):

> <@@ fun input ->
    let sum = (shift input -1 0) .+
     input .+ (shift input +1 0) 
    sum /| 3.0f
  @@>;;
val it : Expr<Matrix<float32> -> Matrix<float32>>

You'll understand the implementation of the function in full details after reading the article. Briefly, the shift function moves values in the matrix by specified offset (corresponding to Shift in Accelerator). The .+ operator performs point-wise addition of two matrices and finally, the /| operator is a point-wise division by a scalar value.

However, we're looking at the example to understand F# quotations. As you can see, the entire function is wrapped inside the <@@ ... @@> operator. This tells the compiler to compile the body as a quotation, which is a data structure representing the code. This is also reflected in the type of the result. The type is inferred as Expr<'T> where 'T is the type of the wrapped function that we implemented using lambda function syntax. When we get a value of this type, we cannot execute the function (because it was compiled as a data structure, not as an executable code). We can use the Raw property to get a value of non-generic type Expr, which can be analyzed, translated to other language or processed in some other way. F# also provides an operator <@@@@ ... @@@@> which gives us the untyped raw quotation directly.

Later in the article, we'll use quotations for translating F# code to code that's executed on GPU using Accelerator. We'll take a quoted code that contains some understood functions and operators (such as select and .+) and run some processing that gives us a function performing the same operation using GPU. However, there is one more interesting thing when it comes to quotations. We can also take a quotation of a standard F# function when it is marked with the ReflectedDefinition attribute:

> [<ReflectedDefinition>]
  let blur input =
    let sum = (shift input -1) .+
      input .+ (shift input +1) 
    sum /| 3.0f;;
val blur : Matrix<float32> -> Matrix<float32>

This time, we wrote a standard function named blur. As you can see, the inferred type signature is also a usual function. The interesting thing about the listing is the use of ReflectedDefinition. When the compiler sees this attribute, it compiles the function into an executable code and in addition stores the quotation of the function. This means that when we later attempt to transform code <@@@@ blur @@@@> using Accelerator, we'll be able to get the body of the blur function and translate it.

This is a very interesting feature, because we can write an ordinary function that calculates with the F# Matrix type. We can test and debug this function, because it is standard executable function. When we know that the function works correctly, we can take its quotation and run it more efficiently using Accelerator.

Data-parallel Matrix operations

I mentioned that we can write our calculations as standard F# programs using some well-known functions that are understood by a library that evaluates F# quotations using Accelerator. In this section, we'll discuss these well-known functions. These functions implement similar functionality as Accelerator, but are implemented as standard F# functions using the generic Matrix type. This means that we can test and debug the code easily in F#. In addition to these data-parallel operations, the translator also allows us to use a type representing quadruple of floats. In the next section, we'll start with this type.

Introducing the float4 type

The float4 type is implemented in the files Float4.fsi and Float4.fs. The first file defines the public interface and the second one is an implementation file. The implementation follows the standard F# patterns for implementing a numeric type (so if you'll need to implement your own numeric type, this example could be a good starting point!) It starts by declaring the Float4 type and a type alias float4. Then it defines a module Float4 with various useful functions for working with values of the type. Finally, it implements an intrinsic type extension that adds overloaded operators and uses functions from the Float4 module in the implementation. We'll introduce the type with a single example:

> #r "FSharp.PowerPack.dll";;
> #load "Float4.fs";;
> open System.Drawing
  open FSharp.Math;;

> let clr1 = float4(1.0f, 0.5f, 1.0f, 0.0f)
  let clr2 = Float4.ofColor Color.Magenta;;
val clr1 : Float4 = (1,0.5,1,0)
val clr2 : Float4 = (1,1,0,1)

> let sum = List.sum [clr1; clr2];;
val sum : Float4 = (2,1.5,1,1)
 
> sum / Float4.ofSingle(2.0f);;
val it : Float4 = (1,0.75,0.5,0.5)

We start by loading the implementation of the type. Note that you may need to specify the full path to the implementation file. Then we create two float4 values. The first one is initialized using float4 function and the second one is created from a color using one of the conversion functions. Once we have two values, we sum them using List.sum. This is possible, because we provided an implementation of INumeric interface, so the F# runtime knows how to add values of our type. Finally, we divide the sum by 2.0f to get an average value. As you can see, float4 is perfect for representing images and we'll use it exactly for this purpose shortly.

Using F# Matrix type

As I already mentioned, we're going to implement calculations using the F# Matrix type. This type is available in the F# PowerPack in the Microsoft.FSharp.Math namespace. We're going to use a generic version of the type (the non-generic one simply stores values of type float). We can work with it using numerous higher-order functions provided by the F# library:

> open Microsoft.FSharp.Math
  module Matrix = Matrix.Generic;;

> let m1 = Matrix.init 4 4 (fun y x -> float32(x*10 + y));;
val m1 : Matrix<float32> = 
  matrix [[0.0f; 10.0f; 20.0f; 30.0f]; [1.0f; 11.0f; 21.0f; 31.0f]
          [2.0f; 12.0f; 22.0f; 32.0f]; [3.0f; 13.0f; 23.0f; 33.0f]]

> let m2 = m1 |> Matrix.map (fun v -> sqrt(v));;
val m2 : Matrix<float32> =
  matrix [[0.0f; 3.1f; 4.4f; 5.4f]; [1.0f; 3.3f; 4.5f; 5.5f]
          [1.4f; 3.4f; 4.6f; 5.6f]; [1.7f; 3.6f; 4.7f; 5.7f]]

We started by opening the namespace with various mathematical types and modules and by creating an alias for a module, which contains functions for working with generic matrices. Then we use Matrix.init to initialize a matrix that contains floating-point number representing X and Y coordinates. The function provided as an argument is called for each element to calculate the initial element value. In the next step, we use the Matrix.map function to calculate square root of every element in the matrix.

This example isn't particularly complicated or interesting. However, we can use it to demonstrate different approach for encoding matrix calculations. So far, we often used operations that calculate with individual elements of the matrix (the v value inside Matrix.map) or with the individual coordinates (x and y in Matrix.init). This is the usual approach, but it can contain very complicated processing logic with coordinates or values as inputs. This would make translating the code to GPU code difficult, because the constructs we can use on GPUs are in many ways limited. In the next section, we'll look at another way of writing operations with matrices, which is more suitable for automatic translation to GPU code.

Data-parallel matrix operations

In this section, we'll look at the functions from the DataParallel module in FSharp.Math namespace. This is a functionality I implemented (and you'll find download link at the end of the article). It mostly just re-implements the operations that are available in Accelerator, but for the standard F# Matrix type, so that we can write standard F# code using data-parallel operations.

The general aspect of all the operations is that they never explicitly calculate with coordinates or individual values stored in the matrix. They perform the same operation on all elements of the matrix, which is the key aspect that makes translation to GPU code (in our case, implemented using Accelerator) possible. Let's first look how to implement the same functionality as in previous listing, using operations from the DataParallel module:

> #load @@"DataParallel.fs";;
> open FSharp.Math
  open FSharp.Math.DataParallel;;

> let posX = positions 4 4 1
  let posY = positions 4 4 0;;
val posX : Matrix<int> = 
  matrix [[0; 0; 0; 0]; [1; 1; 1; 1]
          [2; 2; 2; 2]; [3; 3; 3; 3]]
val posY : Matrix<int> = 
  matrix [[0; 1; 2; 3]; [0; 1; 2; 3]
          [0; 1; 2; 3]; [0; 1; 2; 3]]

> let m1Ints = posX *| 10 .+ posY
  let m1 = Conversions.singleOfInt m1Ints;;
val m1 : Matrix<float32> = 
  matrix [[0.0f; 10.0f; 20.0f; 30.0f]; [1.0f; 11.0f; 21.0f; 31.0f]
          [2.0f; 12.0f; 22.0f; 32.0f]; [3.0f; 13.0f; 23.0f; 33.0f]]

> let m2 = pointwiseSqrt m1;;
val m2 : Matrix<float32> = 
  matrix [[0.0f; 3.1f; 4.4f; 5.4f]; [1.0f; 3.3f; 4.5f; 5.5f]
          [1.4f; 3.4f; 4.6f; 5.6f]; [1.7f; 3.6f; 4.7f; 5.7f]]

The listing starts by loading the file with the implementation (later we'll use it in a project, so you can reference the library or include the file in your project) and by opening the necessary namespaces. Then we start creating the matrix with numbers corresponding to coordinates.

As the first step, we use the positions function. First two arguments specify the required dimensions (4x4 in our case). The function initializes a new matrix of this size filled with X or Y coordinates (represented as integers) depending on the last argument. As you can see, we called it twice. In the first command, we initialize values with X coordinates - note that all column vectors of the matrix all the same. In the second case, we initialize matrix with Y coordinates and all the row vectors are the same.

Now we have matrices to start with, so we can create more complicated matrices by performing point-wise operations with the initial ones. We start by multiplying values of posX by a scalar value using operator *| (there are similar operators for adding scalar +| and others). Then we add the result with the posY matrix using point-wise addition .+ (again, there are similar operators such as ./ and .*).

On the next line, we convert the type from Matrix<int> to Matrix<float32> using a function from the DataParallel.Conversions module (you can find other conversion functions in the source code). Now we get exactly the same result as in the earlier example that used Matrix.init. As the last step, we use the pointwiseSqrt function, which calculates square root of each element in the matrix.

As you can see, getting the first matrix was a bit more complicated, because we have a limited set of matrices to start with. We used the positions function to get matrices with coordinates as values. Other useful starting points are constants. The library offers numerous constant matrices such as zerof (containing 0.0f) and others, or you can use the matrixConstant function. On the other hand, calculating square root is a bit simpler, because we have a point-wise function that operates over matrices. However, the most notable thing about the code is that we never needed to explicitly work with individual elements or with coordinates. This raises the level of abstraction and hides the implementation of operations (which makes it possible to accelerate the code using GPU).

Implementing data-parallel rotation

Now that we've seen how to write some basic operations with matrices using data-parallel functions, let's look at a more interesting example. In this section, we'll implement a simple version of rotation of an image (the rotated photo of Prague in the introduction was generated by this algorithm). We'll also explore some more functions from the DataParallel module. In the next section, we'll look how to run the function on GPU (which is also a reason why the declaration is marked with the ReflectedDefinition attribute).

[<ReflectedDefinition>]
let rotateImage s c w h whalf hhalf (data:Matrix<float4>) =
  // Initialize arrays with X and Y coordinates of a bitmap 
  // and convert numbers to mid-image coordinates
  let posX = (Conversions.singleOfInt (positions w h 0)) -| whalf
  let posY = (Conversions.singleOfInt (positions w h 1)) -| hhalf
    
  // Calculate rotated coordinates 
  let rotatedX = posX *| c .- posY *| s
  let rotatedY = posY *| c .+ posX *| s
  
  // Convert back to corner coordinates.
  let posX = rotatedX +| whalf
  let posY = rotatedY +| hhalf

  // Calculate X and Y indices and get values at indices  
  // Note: We'll implement version wiht smoothing later!
  let indX = Conversions.intOfSingle posX
  let indY = Conversions.intOfSingle posY
  gather data indX indY

We start with the matrices created by the positions function, but we immediately subtract a half of the width or height respectively. This way we get coordinates relatively to the center of the image. The values whalf and hhalf are passed as parameters to the function, because they will be calculated in advance on the CPU. Inside the rotateImage function, we perform only data-parallel operations on matrices, so that it can be executed on Accelerator using quotations.

Once we have the matrices with coordinates (posX and posY), we rotate the coordinates. The result will be two matrices (rotatedX and rotatedY) of the same size. These matrices contain the rotated coordinates. This means that if we have some x and y, we can get the coordinates x', y' = rotatedX.[x, y], rotatedY.[x, y]. If we then perform a lookup into the original image data.[x', y'], we'll get the pixel at the specified x, y location after the rotation.

To calculate the rotated coordinates, we use sine and cosine of the required angle. These two values are also calculated in advance on the CPU and are passed as parameters (c and s) to our function. We calculate the coordinates using point-wise addition and subtraction and using the multiplication by scalar value. After calculating the rotated values, we also convert the coordinates back to the corner coordinates, so that we can perform the lookup in the source image (data).

The last block of code in the function implements the lookup. We first convert the rotated coordinates to exact locations (in integers). Then we use a gather function, which is a data-parallel version of lookup operation. It takes a source matrix (data) and a pair of matrices with X and Y coordinates (indX and indY). For each position in these two matrices, it finds the value at specified location in the source matrix and returns it as an element of the result. If we wanted to write this operation in F#, we'd calculate the result for each element using the following equation: res.[y, x] = data.[indY.[y, x], indX.[y, x]].

Understanding the code for the rotation isn't easy for the first time. You can also find some useful information in the documentation for Accelerator, which explains the operations like gather in more details. However, once you understand the code, you'll be surprised by its elegance.

Running rotation on CPU

Now we have the core part of the rotation algorithm. We wrote a function that takes some pre-computed parameters and an input image represented as a matrix of type Matrix<float4> and returns a rotated image of the same matrix type. We start by calling this function from an ordinary F# code running on CPU and later we'll look how to run this function using Accelerator.

We'll first load the input bitmap and implement a utility higher-order function that will be useful for both CPU and GPU versions. It takes an actual function that performs the rotation as the first parameter. The second parameter and the return type are the same and contain the rotated image and the desired angle (Matrix<float4> * float).

let bmp = Bitmap.FromFile(Application.StartupPath+"\\prague.jpg") :?> Bitmap   
let data = bmp |> Conversions.matrixOfBitmap Float4.ofColor

/// Performs pre-computation of sin and cos values and then
/// invokes the rotation function passed as the first parameter
let rotationStep invokeRotation (_, angle) =    
  let angle = if (angle >= 359.0) then 0.0 else (angle + 1.0)
  let angleRad = Math.PI / 180.0 * angle
  let (s, c) = float32(Math.Sin(angleRad)), float32(Math.Cos(angleRad))      
  let (w, h) = (bmp.Width, bmp.Height)
  (invokeRotation s c w h (w / 2) (h / 2), angle)

We start by loading a bitmap from the application startup folder and then convert it to a matrix using one of the utility functions from the Conversions module. This conversion uses a specified function to convert individual pixels to the target type of matrix elements, so we use Float4.ofColor to convert colors to values of type float4.

The rotationStep function first calculates a new angle (incremented by 1). Then it converts the angle to radians and pre-computes values that are passed to the actual rotation function (sine and cosine of the angle, half of width and height of the bitmap). Once we have all the pre-computed values, we invoke the specified rotation function and return the result together with the new angle value as the new state.

The last step that we need to do now is to call rotationStep with the first parameter specifying the rotation function running on CPU. If we specify only the first parameter, we'll get a function that calculates a new state (bitmap and angle) from the previous step (bitmap and angle). We'll use a form similar to the one we used when showing the Game of Life to run the function iteratively and display the result after each step:

let run = rotationStep (fun sn cs wid hgt widhalf hgthalf ->
  rotateImage sn cs wid hgt widhalf hgthalf data)
let toBitmap = fst >> Conversions.bitmapOfMatrix Float4.toColor  
DrawingForm.Run((data, 0.0), run, toBitmap)

The first parameter of the form is an initial state. The second parameter is a function that calculates a new state from the previous one. Finally, the last parameter is a function that converts the current state (Matrix<float4> * float) to a bitmap that can be drawn on the form. We create this function using function composition. We first take the first element of the tuple (a matrix) and then convert it to bitmap using the bitmapOfMatrix function.

Accelerating data-parallel F#

You're probably reading this article to learn how to run computations more efficiently on GPU or multi-core CPU, but we're almost at the end of the article and I didn't write a single word on this topic! Don't be disappointed. Actually, all we did in the article so far was a preparation that makes running the code using Accelerator surprisingly easy!

We wrote the rotateImage just using functions from the DataParallel module and we marked the code using ReflectedDefinition attribute. Now we can use a translator I implemented (which is available as a source code download below). The translator evaluates quoted data-parallel code using Accelerator, so we only need to invoke it:

open FSharp.Accelerator
open FSharp.Accelerator.EvalTransformation

let target = new Microsoft.ParallelArrays.DX9Target()
let processed = Accelerate.accelerate <@@@@ rotateImage @@@@>
let runAcc =    
  rotationStep (fun sn cs wid hgt widhalf hgthalf ->
    eval<Matrix<float4>> target processed
      [ makeValue sn; makeValue cs; makeValue wid; makeValue hgt;  
        makeValue widhalf; makeValue hgthalf; makeValue data ])

let toBitmap = fst >> Conversions.bitmapOfMatrix Float4.toColor
DrawingForm.Run((data, 0.0), runAcc, toBitmap)

The code is slightly more complicated than the version running on CPU, but it is still very simple. We first initialize an Accelerator target (in this example, we'll run the code on GPU using the DX9Target). Then we run the accelerate function, which is a part of the translator. It returns a new function which runs the function enclosed in quotations using Accelerator. As I already mentioned, thanks to the ReflectedDefinition attribute, the translation function can also look inside the body of rotateImage.

Once we have the accelerated function, we can use it inside the lambda function specified as a parameter of rotationStep. We cannot invoke the processed function directly, because all the parameters and the result are encoded in some way. Instead, we use the generic eval function, which takes the desired type of result as a type parameter (in our case, we specify that it should return a value of type Matrix<float4>). The first parameter is the Accelerator target, the second parameter is the accelerated function to run and finally, the last parameter is a list of arguments to the accelerated function. We also need to turn all the parameters into a special value that the eval function expects, which is done using the makeValue function.

Summary

In this article, we've seen how to work with the F# Matrix type using data-parallel functions that duplicate the functionality available in Accelerator, but work with standard F# data types. The implementation of all the functions from the DataParallel module is a part of the project that translates data-parallel F# code to Accelerator using F# quotations and you can get it from the list of downloads below. We've also seen a more complicated data-parallel calculation when we looked at implementation of an image rotation.

When we write a function using only primitives understood by the translator and when we mark the function using ReflectedDefinition, the translator can evaluate the function more efficiently using Accelerator (either using GPU or multi-core CPU). However, we can still run the code as a standard F# code, which is very useful when testing and debugging.

In this article, we looked only at the simplest implementation of rotation. When we used gather to find the location of a rotated pixel, we always used just the nearest neighbor. We can get better results by collecting several nearby pixels and interpolating the color depending on the fraction (because the desired location may be for example [121.3f, 35.8f]. This better version is implemented in the downloadable source code using the interpolatef4 function, which performs a linear interpolation between float4 values.

Finally, in this article we used the EvalTransformation module to translate the code from quotations to Accelerator. We didn't discuss the translation architecture in details (we'll do that in the next article!). However, this module processes quotations and evaluates the code using Accelerator at the same time. This makes the source code easier to understand (the downloadable source code includes only this module, so it should be fairly readable). However, a more efficient approach is to process quotations and build a function that can be executed later. When we'd want to rotate the image, we'd just invoke the function and we wouldn't have to analyze the code again. We'll look at this more efficient approach in the next article.

Downloads and References

Download the source code (ZIP, 1.09MB)

[1] Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses - Microsoft Research
[2] Accelerator Project Homepage - Microsoft Research
[3] Microsoft Research Accelerator v2 - Microsoft Connect
[4] GPGPU and x64 Multicore Programming with Accelerator from F# - Satnam Singh's blog at MSDN

Published: Monday, 4 January 2010, 12:50 PM
Author: Tomas Petricek
Typos: Send me a pull request!
Tags: academic, functional, meta-programming, accelerator, f#, math and numerics