New features and improvements in Deedle v1.0

As Howard Mansell already announced on the BlueMountain Tech blog, we have officially released the "1.0" version of Deedle. In case you have not heard of Deedle yet, it is a .NET library for interactive data analysis and exploration. Deedle works great with both C# and F#. It provides two main data structures: series for working with data and time series and frame for working with collections of series (think CSV files, data tables etc.)

The great thing about Deedle is that it has been becoming a foundational library that makes it possible to integrate a wide range of diverse data-science components. For example, the R type provider works well with Deedle and so does F# Charting. We've been also working on integrating all of these into a single package called FsLab, but more about that next time!

In this blog post, I'll have a quick look at a couple of new features in Deedle (and corresponding R type provider release). Howard's announcement has a more detailed list, but I just want to give a couple of examples and briefly comment on performance improvemens we did.

What's new in Deedle?

Perhaps the most visible difference in the new version is that many of the functions are renamed. We thought that before v1.0, we had a unique chance to get the naming right, so we did a lot of renamings to make sure that everything is consistent. For example, some functions used series and some column, some used sort and others order and so on. This should now be cleaned up. Similarly, we fixed a number of mismatches between Series and Frame modules.

Additions to Deedle API

Aside from renaming, we also added a couple of useful functions. For example, the homepage sample compares survival ration for different passenger classes. This can now be done even more easily using PivotTable:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
#load "Deedle.1.0.0/Deedle.fsx"
open Deedle

let titanic = Frame.ReadCsv("../data/titanic.csv")

// Pivot operation using "Sex" as row 
// and "Survived" as a new column
titanic 
|> Frame.pivotTable 
    (fun _ row -> row.GetAs<string>("Sex"))
    (fun _ row -> row.GetAs<bool>("Survived"))
    Frame.countRows

// The same operation using method notation
titanic.PivotTable<string, bool, _>
  ("Sex", "Survived", Frame.countRows)

The operation groups the rows according to the two keys and then performs aggregation using the specified function (here Frame.countRows). This is a common operation and so we wanted to make it as simple as possible. We also continue to expose operations both as F# functions in modules and as C#-friendly methods.

Another example where we did lot of improvements is statistics:

1: 
2: 
3: 
4: 
let msft = Frame.ReadCsv<DateTime>("../data/msft.csv", "Date")
msft?Open |> Stats.movingVariance 100
msft?Open |> Stats.expandingMean
msft?Open |> Stats.kurt

The first improvement is that you can now specify key column when loading data from a CSV file (again, this is very common). The same feature is available when loading data from a sequence of .NET objects using Frame.ofRows.

The next new thing is the Stats module. This is the new place for all functions related to statistics and numerical computations. We found that adding more functions to Series and Frame modules was a bit confusing, so we moved all statistical functions in one place. This is even more important now that we added more functions (kurtosis, skewness, variance) and we added more ways to calculate them (moving and expanding windows). For more information see the frame and series statistics page.

Improved documentation

Finally, one of the strong points of Deedle is that it has an excellent documentation. This is now even more the case, because we polished the documentation automatically generated from Markdown comments in the source code. In particular, for the three core modules:

  • Series module provides functions for working with individual data series and time-series values. This includes operations such as sampling, transformations, data access and more.
  • Frame module `provides functions that are similar to those in the Series module, but operate on entire data frames. You can transform, align and join frames, perform various re-indexing operations etc.
  • Stats module implements standard statistical functions (mean, variance, kurtosis, skewness, etc.) over series, moving windows, expanding windows and a lot more. The module contains functions for both series and frames.

What's new in the R provider?

Together with a new release of Deedle, we also updated the R type provider. There are a couple of improvements that make it work a lot better:

  • The installation from NuGet does no longer rely on PowerShell installation script, so it can work on Mono and when using the "Restore Packages" feature.
  • The type provider communicates with R via a separate process, so it is more stable and it will also let us call 64bit version of R.

These are technical, but very important improvements. However, we also added one nice new feature that makes it even easier to mix R and F#!

RData type provider

In R, you can save workspaces (environments) into *.rdata files. This is useful if you want to archive results of some interactive analysis done in the R environment. But, wouldn't it be nice if you could do some data analysis in R and then save the data to a file and load it easily from F# in a type-safe way?

This is exactly what you get with the RData type provider! Let's say that I have cars.rdata file containing the mtcars data set (saved under the name cars) together with a list mpg and a value mpgMean. I can write:

 1: 
 2: 
 3: 
 4: 
 5: 
 6: 
 7: 
 8: 
 9: 
10: 
11: 
12: 
13: 
14: 
15: 
#load "RProvider.1.0.9/RProvider.fsx"
open RProvider

let file = new RData<"../data/cars.rdata">()

// Calculate mean in R and in F#
let mean1 = file.mpg |> Array.average
let mean2 = file.mpgMean.[0]

// Average mpg based on cylinder count
file.cars
|> Frame.groupRowsByInt "cyl"
|> Frame.getCol "mpg"
|> Stats.levelMean fst
|> Series.observations

If you look at the types, you'll see that file.mpg is of type float[] and file.cars is of type Frame<string, string>. The R type provider uses the installed plugins (like the Deedle plugin) to find the most appropriate F# type for exposing the data and so the R data frame cars is automatically exposed as Deedle frame.

This lets us quickly group the values by "cyl" (number of cylinders) and then calculate average miles per gallon "mpg" for each of the groups. Using F# Charting, the result looks like this:

Deedle performance improvements

In this release of Deedle, we spent some time on improving the performance. The first version was designed with performance in mind and the internals make it possible to implement operations efficiently (e.g. in F#, it is quite easy to write code so that the data is stored in continuous memory blocks). However, there were a number of places where some Deedle function just used the "simplest stupid way to get things done".

This was nice, because it let us quickly build a sophisticated and easy to use API, but there were cases where things were just too slow. So, improving performance is an ongoing effort and if you find a use case where Deedle is slow, please submit an issue!

Measuring performance

To make sure we can monitor the performance, I created a fairly simple tool that lets us measure performance automatically. This is currently available in my branch. The tool is started via a FAKE script and it measures the performance of all tests in a specified file. The tests also serve as unit tests. For example:

1: 
2: 
3: 
[<Test;PerfTest(Iterations=10)>]
let ``Merge 3 unordered 300k long series (repeating Merge)`` () =
  r1.Merge(r2).Merge(r3).KeyCount |> shouldEqual 900000

The PerfTest attribute specifies that the function is a performance test and it also lets us specify number of iterations (so that we run quick tests repeatedly, but slow tests only a few times).

Absolute performance

I did two simple analyses of the performance. The first chart compares the new version of Deedle with the previous version available on NuGet:

• v0.9.12 (November 2013)
• v1.0.0 (May 2014)

The numbers represent the total number of milliseconds needed to run the test. Note that the X axis is limited to 10 seconds, but some of the tests actually take longer using the old version. Also, some tests only have value when using the new version - this is because they are using function that is new in v1.0.

A couple of points worth mentioning:

  • Some of the notable improvements are when merging series - this also applies to joining of frames (e.g. when applying numerical operations). We also added overload of Merge on frames that can merge multiple series at once, which is significantly faster (and lets you merge e.g. 1000 frames, which was previously too slow).

  • There is a number of improvements in Resample operations. Again, this is just an example of a more general speedup (that also affects windowing and chunking functions).

Relative performance

In the previous chart, it is a bit difficult to see what is the greatest performance improvement. In the following chart, the tests are scaled so that the performance using original version (0.9.12) is used as 100% and the performance using the new version is shown as a percentage (so cutting 10sec down to 5sec shows as 50%)

Again, you can see a number of interesting things:

  • The biggest speedup is on "Accessing float series via object series". This is the case when you access a column on a frame using df.Columns (which returns a series of ObjectSeries<'K> values). Because we do not know the type of individual columns, we return them as series containing obj values. In the new version, this does not actually box the values and so convertin the series back to Series<'K, float> is essentially no-op.

  • We also did some work on improving grouping (and related) operations, so, for example the homepage sample is now about twice as fast. There is still (a lot of) room for improvement, but as you can see, we're working hard on this!

  • The joining and merging operations are about 6x faster, but for Merge this is even more significant when you're merging multiple frames.

The tests that I included here are by no means comprehensive. They simply represent a couple of test cases that I was working on. However, with the perfromance measurements in place, we should be able to use this more and more often! So, if you have an interesting use case, submit a pull requst adding a performance test!

Summary

The "1.0" release of Deedle is an important milestone. Although Deedle has been around since November (and it has been used internally by BlueMountain), the "1.0" release means that the library is becoming more stable and ready for others to adopt.

Of course, there is always room for improvement. There are operations that could be faster (please report them!), there are functions that should be added (please suggest them!) and there are likely a few remaining bugs. I marked some issues as up-for-grabs in case you wanted to contribute directly.

Another important thing about Deedle is that it is a foundational component around which we can build an awesome .NET data science stack. If you're interested, register at www.fslab.org and follow this blog for more information.

There are many people who contributed to Deedle (and R provider), but the projects wouldn't exist without Howard Mansell and Adam Klein at BlueMountain. A lot of the R provider work has been done by David Charboneau. Thanks!

namespace Deedle
val titanic : Frame<int,string>

Full name: Deedle-v1_.titanic
Multiple items
module Frame

from Deedle

--------------------
type Frame =
  static member CreateEmpty : unit -> Frame<'R,'C> (requires equality and equality)
  static member FromArray2D : array:'T [,] -> Frame<int,int>
  static member FromColumns : cols:Series<'TColKey,Series<'TRowKey,'V>> -> Frame<'TRowKey,'TColKey> (requires equality and equality)
  static member FromColumns : cols:Series<'TColKey,ObjectSeries<'TRowKey>> -> Frame<'TRowKey,'TColKey> (requires equality and equality)
  static member FromColumns : columns:seq<KeyValuePair<'ColKey,ObjectSeries<'RowKey>>> -> Frame<'RowKey,'ColKey> (requires equality and equality)
  static member FromColumns : columns:seq<KeyValuePair<'ColKey,Series<'RowKey,'V>>> -> Frame<'RowKey,'ColKey> (requires equality and equality)
  static member FromColumns : rows:seq<Series<'ColKey,'V>> -> Frame<'ColKey,int> (requires equality)
  static member FromRecords : values:seq<'T> -> Frame<int,string>
  static member FromRecords : series:Series<'K,'R> -> Frame<'K,string> (requires equality)
  static member FromRowKeys : keys:seq<'K> -> Frame<'K,string> (requires equality)
  ...

Full name: Deedle.Frame

--------------------
type Frame<'TRowKey,'TColumnKey (requires equality and equality)> =
  interface IDynamicMetaObjectProvider
  interface INotifyCollectionChanged
  interface IFsiFormattable
  interface IFrame
  new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
  private new : rowIndex:IIndex<'TRowKey> * columnIndex:IIndex<'TColumnKey> * data:IVector<IVector> -> Frame<'TRowKey,'TColumnKey>
  member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> -> unit
  member AddColumn : column:'TColumnKey * series:seq<'V> -> unit
  member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> * lookup:Lookup -> unit
  member AddColumn : column:'TColumnKey * series:seq<'V> * lookup:Lookup -> unit
  ...

Full name: Deedle.Frame<_,_>

--------------------
new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
static member Frame.ReadCsv : path:string * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int -> Frame<int,string>
static member Frame.ReadCsv : stream:IO.Stream * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int -> Frame<int,string>
static member Frame.ReadCsv : path:string * indexCol:string * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int -> Frame<'R,string> (requires equality)
val pivotTable : rowGrp:('R -> ObjectSeries<'C> -> 'RNew) -> colGrp:('R -> ObjectSeries<'C> -> 'CNew) -> op:(Frame<'R,'C> -> 'T) -> frame:Frame<'R,'C> -> Frame<'RNew,'CNew> (requires equality and equality and equality and equality)

Full name: Deedle.Frame.pivotTable
val row : ObjectSeries<string>
member ObjectSeries.GetAs : column:'K -> 'R
member ObjectSeries.GetAs : column:'K * fallback:'R -> 'R
Multiple items
val string : value:'T -> string

Full name: Microsoft.FSharp.Core.Operators.string

--------------------
type string = String

Full name: Microsoft.FSharp.Core.string
type bool = Boolean

Full name: Microsoft.FSharp.Core.bool
val countRows : frame:Frame<'R,'C> -> int (requires equality and equality)

Full name: Deedle.Frame.countRows
static member FrameExtensions.PivotTable : frame:Frame<'R,'C> * r:'C * c:'C * op:Func<Frame<'R,'C>,'T> -> Frame<'R,'C> (requires equality and equality)
member Frame.PivotTable : r:'TColumnKey * c:'TColumnKey * op:(Frame<'TRowKey,'TColumnKey> -> 'T) -> Frame<'R,'C> (requires equality and equality and equality and equality)
val msft : Frame<DateTime,string>

Full name: Deedle-v1_.msft
Multiple items
type DateTime =
  struct
    new : ticks:int64 -> DateTime + 10 overloads
    member Add : value:TimeSpan -> DateTime
    member AddDays : value:float -> DateTime
    member AddHours : value:float -> DateTime
    member AddMilliseconds : value:float -> DateTime
    member AddMinutes : value:float -> DateTime
    member AddMonths : months:int -> DateTime
    member AddSeconds : value:float -> DateTime
    member AddTicks : value:int64 -> DateTime
    member AddYears : value:int -> DateTime
    ...
  end

Full name: System.DateTime

--------------------
DateTime()
   (+0 other overloads)
DateTime(ticks: int64) : unit
   (+0 other overloads)
DateTime(ticks: int64, kind: DateTimeKind) : unit
   (+0 other overloads)
DateTime(year: int, month: int, day: int) : unit
   (+0 other overloads)
DateTime(year: int, month: int, day: int, calendar: Globalization.Calendar) : unit
   (+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int) : unit
   (+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, kind: DateTimeKind) : unit
   (+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, calendar: Globalization.Calendar) : unit
   (+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int) : unit
   (+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int, kind: DateTimeKind) : unit
   (+0 other overloads)
type Stats =
  static member count : frame:Frame<'R,'C> -> Series<'C,int> (requires equality and equality)
  static member count : series:Series<'K,'V> -> int (requires equality)
  static member expandingCount : series:Series<'K,float> -> Series<'K,float> (requires equality)
  static member expandingKurt : series:Series<'K,float> -> Series<'K,float> (requires equality)
  static member expandingMax : series:Series<'K,float> -> Series<'K,float> (requires equality)
  static member expandingMean : series:Series<'K,float> -> Series<'K,float> (requires equality)
  static member expandingMin : series:Series<'K,float> -> Series<'K,float> (requires equality)
  static member expandingSkew : series:Series<'K,float> -> Series<'K,float> (requires equality)
  static member expandingStdDev : series:Series<'K,float> -> Series<'K,float> (requires equality)
  static member expandingSum : series:Series<'K,float> -> Series<'K,float> (requires equality)
  ...

Full name: Deedle.Stats
static member Stats.movingVariance : size:int -> series:Series<'K,float> -> Series<'K,float> (requires equality)
static member Stats.expandingMean : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member Stats.kurt : frame:Frame<'R,'C> -> Series<'C,float> (requires equality and equality)
static member Stats.kurt : series:Series<'K,float> -> float (requires equality)
namespace RProvider
val file : RData<...>

Full name: Deedle-v1_.file
type RData

Full name: RProvider.RData
val mean1 : float

Full name: Deedle-v1_.mean1
property RData<...>.mpg: float []
type Array =
  member Clone : unit -> obj
  member CopyTo : array:Array * index:int -> unit + 1 overload
  member GetEnumerator : unit -> IEnumerator
  member GetLength : dimension:int -> int
  member GetLongLength : dimension:int -> int64
  member GetLowerBound : dimension:int -> int
  member GetUpperBound : dimension:int -> int
  member GetValue : params indices:int[] -> obj + 7 overloads
  member Initialize : unit -> unit
  member IsFixedSize : bool
  ...

Full name: System.Array
val average : array:'T [] -> 'T (requires member ( + ) and member DivideByInt and member get_Zero)

Full name: Microsoft.FSharp.Collections.Array.average
val mean2 : float

Full name: Deedle-v1_.mean2
property RData<...>.mpgMean: float []
property RData<...>.cars: Frame<string,string>
val groupRowsByInt : column:'C -> frame:Frame<'R,'C> -> Frame<(int * 'R),'C> (requires equality and equality)

Full name: Deedle.Frame.groupRowsByInt
val getCol : column:'C -> frame:Frame<'R,'C> -> Series<'R,'V> (requires equality and equality)

Full name: Deedle.Frame.getCol
static member Stats.levelMean : level:('K -> 'L) -> series:Series<'K,float> -> Series<'L,float> (requires equality and equality)
val fst : tuple:('T1 * 'T2) -> 'T1

Full name: Microsoft.FSharp.Core.Operators.fst
Multiple items
module Series

from Deedle

--------------------
type Series =
  static member ofNullables : values:seq<Nullable<'a0>> -> Series<int,'a0> (requires default constructor and value type and 'a0 :> ValueType)
  static member ofObservations : observations:seq<'a0 * 'a1> -> Series<'a0,'a1> (requires equality)
  static member ofOptionalObservations : observations:seq<'K * 'a1 option> -> Series<'K,'a1> (requires equality)
  static member ofValues : values:seq<'a0> -> Series<int,'a0>

Full name: Deedle.FSharpSeriesExtensions.Series

--------------------
type Series<'K,'V (requires equality)> =
  interface IFsiFormattable
  interface ISeries<'K>
  new : pairs:seq<KeyValuePair<'K,'V>> -> Series<'K,'V>
  new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>
  new : index:IIndex<'K> * vector:IVector<'V> * vectorBuilder:IVectorBuilder * indexBuilder:IIndexBuilder -> Series<'K,'V>
  member After : lowerExclusive:'K -> Series<'K,'V>
  member Aggregate : aggregation:Aggregation<'K> * observationSelector:Func<DataSegment<Series<'K,'V>>,KeyValuePair<'TNewKey,OptionalValue<'R>>> -> Series<'TNewKey,'R> (requires equality)
  member Aggregate : aggregation:Aggregation<'K> * keySelector:Func<DataSegment<Series<'K,'V>>,'TNewKey> * valueSelector:Func<DataSegment<Series<'K,'V>>,OptionalValue<'R>> -> Series<'TNewKey,'R> (requires equality)
  member AsyncMaterialize : unit -> Async<Series<'K,'V>>
  member Before : upperExclusive:'K -> Series<'K,'V>
  ...

Full name: Deedle.Series<_,_>

--------------------
new : pairs:seq<Collections.Generic.KeyValuePair<'K,'V>> -> Series<'K,'V>
new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>
new : index:Indices.IIndex<'K> * vector:IVector<'V> * vectorBuilder:Vectors.IVectorBuilder * indexBuilder:Indices.IIndexBuilder -> Series<'K,'V>
val observations : series:Series<'K,'T> -> seq<'K * 'T> (requires equality)

Full name: Deedle.Series.observations
Multiple items
type TestAttribute =
  inherit Attribute
  new : unit -> TestAttribute

Full name: Deedle-v1_.TestAttribute

--------------------
new : unit -> TestAttribute
Multiple items
type PerfTestAttribute =
  inherit Attribute
  new : Iterations:int -> PerfTestAttribute

Full name: Deedle-v1_.PerfTestAttribute

--------------------
new : Iterations:int -> PerfTestAttribute
val ( Merge 3 unordered 300k long series (repeating Merge) ) : unit -> unit

Full name: Deedle-v1_.( Merge 3 unordered 300k long series (repeating Merge) )
val r1 : Series<int,float>

Full name: Deedle-v1_.r1
member Series.Merge : params otherSeries:Series<'K,'V> [] -> Series<'K,'V>
member Series.Merge : otherSeries:seq<Series<'K,'V>> -> Series<'K,'V>
member Series.Merge : otherSeries:Series<'K,'V> -> Series<'K,'V>
member Series.Merge : another:Series<'K,'V> * behavior:UnionBehavior -> Series<'K,'V>
val r2 : Series<int,float>

Full name: Deedle-v1_.r2
val r3 : Series<int,float>

Full name: Deedle-v1_.r3
val shouldEqual : a:'a -> b:'b -> unit

Full name: Deedle-v1_.shouldEqual

Discuss on twitter, .
Send corrections via GitHub pull requests.