New features and improvements in Deedle v1.0
As Howard Mansell already announced on the BlueMountain Tech blog, we have officially released the "1.0" version of Deedle. In case you have not heard of Deedle yet, it is a .NET library for interactive data analysis and exploration. Deedle works great with both C# and F#. It provides two main data structures: series for working with data and time series and frame for working with collections of series (think CSV files, data tables etc.)
The great thing about Deedle is that it has been becoming a foundational library that makes it possible to integrate a wide range of diverse data-science components. For example, the R type provider works well with Deedle and so does F# Charting. We've been also working on integrating all of these into a single package called FsLab, but more about that next time!
In this blog post, I'll have a quick look at a couple of new features in Deedle (and corresponding R type provider release). Howard's announcement has a more detailed list, but I just want to give a couple of examples and briefly comment on performance improvemens we did.
What's new in Deedle?
Perhaps the most visible difference in the new version is that many of the functions
are renamed. We thought that before v1.0, we had a unique chance to get the naming
right, so we did a lot of renamings to make sure that everything is consistent. For
example, some functions used series and some column, some used sort and others
order and so on. This should now be cleaned up. Similarly, we fixed a number of
mismatches between Series
and Frame
modules.
Additions to Deedle API
Aside from renaming, we also added a couple of useful functions. For example, the
homepage sample compares survival
ration for different passenger classes. This can now be done even more easily using
PivotTable
:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: |
|
The operation groups the rows according to the two keys and then performs aggregation
using the specified function (here Frame.countRows
). This is a common operation and
so we wanted to make it as simple as possible. We also continue to expose operations
both as F# functions in modules and as C#-friendly methods.
Another example where we did lot of improvements is statistics:
1: 2: 3: 4: |
|
The first improvement is that you can now specify key column when loading data from a CSV
file (again, this is very common). The same feature is available when loading data from
a sequence of .NET objects using Frame.ofRows
.
The next new thing is the Stats
module. This is the new place for all functions related
to statistics and numerical computations. We found that adding more functions to Series
and Frame
modules was a bit confusing, so we moved all statistical functions in one place.
This is even more important now that we added more functions (kurtosis, skewness, variance)
and we added more ways to calculate them (moving and expanding windows). For more information
see the frame and series statistics page.
Improved documentation
Finally, one of the strong points of Deedle is that it has an excellent documentation. This is now even more the case, because we polished the documentation automatically generated from Markdown comments in the source code. In particular, for the three core modules:
- Series module provides functions for working with individual data series and time-series values. This includes operations such as sampling, transformations, data access and more.
-
Frame module
`provides functions that are similar to those in the
Series
module, but operate on entire data frames. You can transform, align and join frames, perform various re-indexing operations etc. - Stats module implements standard statistical functions (mean, variance, kurtosis, skewness, etc.) over series, moving windows, expanding windows and a lot more. The module contains functions for both series and frames.
What's new in the R provider?
Together with a new release of Deedle, we also updated the R type provider. There are a couple of improvements that make it work a lot better:
- The installation from NuGet does no longer rely on PowerShell installation script, so it can work on Mono and when using the "Restore Packages" feature.
- The type provider communicates with R via a separate process, so it is more stable and it will also let us call 64bit version of R.
These are technical, but very important improvements. However, we also added one nice new feature that makes it even easier to mix R and F#!
RData type provider
In R, you can save workspaces (environments) into *.rdata
files. This is useful
if you want to archive results of some interactive analysis done in the R environment.
But, wouldn't it be nice if you could do some data analysis in R and then save the
data to a file and load it easily from F# in a type-safe way?
This is exactly what you get with the RData
type provider! Let's say that I have
cars.rdata
file containing the mtcars
data set (saved under the name cars
)
together with a list mpg
and a value mpgMean
. I can write:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: |
|
If you look at the types, you'll see that file.mpg
is of type float[]
and
file.cars
is of type Frame<string, string>
. The R type provider uses the installed
plugins (like the Deedle plugin) to find the most appropriate F# type for exposing
the data and so the R data frame cars
is automatically exposed as Deedle frame.
This lets us quickly group the values by "cyl" (number of cylinders) and then calculate average miles per gallon "mpg" for each of the groups. Using F# Charting, the result looks like this:
Deedle performance improvements
In this release of Deedle, we spent some time on improving the performance. The first version was designed with performance in mind and the internals make it possible to implement operations efficiently (e.g. in F#, it is quite easy to write code so that the data is stored in continuous memory blocks). However, there were a number of places where some Deedle function just used the "simplest stupid way to get things done".
This was nice, because it let us quickly build a sophisticated and easy to use API, but there were cases where things were just too slow. So, improving performance is an ongoing effort and if you find a use case where Deedle is slow, please submit an issue!
Measuring performance
To make sure we can monitor the performance, I created a fairly simple tool that lets us measure performance automatically. This is currently available in my branch. The tool is started via a FAKE script and it measures the performance of all tests in a specified file. The tests also serve as unit tests. For example:
1: 2: 3: |
|
The PerfTest
attribute specifies that the function is a performance test and it
also lets us specify number of iterations (so that we run quick tests repeatedly, but
slow tests only a few times).
Absolute performance
I did two simple analyses of the performance. The first chart compares the new version of Deedle with the previous version available on NuGet:
• v0.9.12 (November 2013)
• v1.0.0 (May 2014)
The numbers represent the total number of milliseconds needed to run the test. Note that the X axis is limited to 10 seconds, but some of the tests actually take longer using the old version. Also, some tests only have value when using the new version - this is because they are using function that is new in v1.0.
A couple of points worth mentioning:
-
Some of the notable improvements are when merging series - this also applies to joining of frames (e.g. when applying numerical operations). We also added overload of
Merge
on frames that can merge multiple series at once, which is significantly faster (and lets you merge e.g. 1000 frames, which was previously too slow). -
There is a number of improvements in
Resample
operations. Again, this is just an example of a more general speedup (that also affects windowing and chunking functions).
Relative performance
In the previous chart, it is a bit difficult to see what is the greatest performance improvement. In the following chart, the tests are scaled so that the performance using original version (0.9.12) is used as 100% and the performance using the new version is shown as a percentage (so cutting 10sec down to 5sec shows as 50%)
Again, you can see a number of interesting things:
-
The biggest speedup is on "Accessing float series via object series". This is the case when you access a column on a frame using
df.Columns
(which returns a series ofObjectSeries<'K>
values). Because we do not know the type of individual columns, we return them as series containingobj
values. In the new version, this does not actually box the values and so converting the series back toSeries<'K, float>
is essentially no-op. -
We also did some work on improving grouping (and related) operations, so, for example the homepage sample is now about twice as fast. There is still (a lot of) room for improvement, but as you can see, we're working hard on this!
-
The joining and merging operations are about 6x faster, but for
Merge
this is even more significant when you're merging multiple frames.
The tests that I included here are by no means comprehensive. They simply represent a couple of test cases that I was working on. However, with the performance measurements in place, we should be able to use this more and more often! So, if you have an interesting use case, submit a pull request adding a performance test!
Summary
The "1.0" release of Deedle is an important milestone. Although Deedle has been around since November (and it has been used internally by BlueMountain), the "1.0" release means that the library is becoming more stable and ready for others to adopt.
Of course, there is always room for improvement. There are operations that could be faster (please report them!), there are functions that should be added (please suggest them!) and there are likely a few remaining bugs. I marked some issues as up-for-grabs in case you wanted to contribute directly.
Another important thing about Deedle is that it is a foundational component around which we can build an awesome .NET data science stack. If you're interested, register at www.fslab.org and follow this blog for more information.
There are many people who contributed to Deedle (and R provider), but the projects wouldn't exist without Howard Mansell and Adam Klein at BlueMountain. A lot of the R provider work has been done by David Charboneau. Thanks!
static member CommandLine : string
static member CurrentDirectory : string with get, set
static member Exit : exitCode:int -> unit
static member ExitCode : int with get, set
static member ExpandEnvironmentVariables : name:string -> string
static member FailFast : message:string -> unit + 1 overload
static member GetCommandLineArgs : unit -> string[]
static member GetEnvironmentVariable : variable:string -> string + 1 overload
static member GetEnvironmentVariables : unit -> IDictionary + 1 overload
static member GetFolderPath : folder:SpecialFolder -> string + 1 overload
...
nested type SpecialFolder
nested type SpecialFolderOption
Full name: System.Environment
namespace FSharp
--------------------
namespace Microsoft.FSharp
Full name: Deedle-v1.shouldEqual
type TestAttribute =
inherit Attribute
new : unit -> TestAttribute
Full name: Deedle-v1.TestAttribute
--------------------
new : unit -> TestAttribute
type Attribute =
member Equals : obj:obj -> bool
member GetHashCode : unit -> int
member IsDefaultAttribute : unit -> bool
member Match : obj:obj -> bool
member TypeId : obj
static member GetCustomAttribute : element:MemberInfo * attributeType:Type -> Attribute + 7 overloads
static member GetCustomAttributes : element:MemberInfo -> Attribute[] + 15 overloads
static member IsDefined : element:MemberInfo * attributeType:Type -> bool + 7 overloads
Full name: System.Attribute
--------------------
Attribute() : unit
type PerfTestAttribute =
inherit Attribute
new : Iterations:int -> PerfTestAttribute
Full name: Deedle-v1.PerfTestAttribute
--------------------
new : Iterations:int -> PerfTestAttribute
val int : value:'T -> int (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.int
--------------------
type int = int32
Full name: Microsoft.FSharp.Core.int
--------------------
type int<'Measure> = int
Full name: Microsoft.FSharp.Core.int<_>
Full name: Deedle-v1.titanic
module Frame
from Deedle
--------------------
type Frame =
static member ReadReader : reader:IDataReader -> Frame<int,string>
static member CustomExpanders : Dictionary<Type,Func<obj,seq<string * Type * obj>>>
static member NonExpandableInterfaces : List<Type>
static member NonExpandableTypes : HashSet<Type>
Full name: Deedle.Frame
--------------------
type Frame<'TRowKey,'TColumnKey (requires equality and equality)> =
interface IDynamicMetaObjectProvider
interface INotifyCollectionChanged
interface IFsiFormattable
interface IFrame
new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
private new : rowIndex:IIndex<'TRowKey> * columnIndex:IIndex<'TColumnKey> * data:IVector<IVector> -> Frame<'TRowKey,'TColumnKey>
member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> -> unit
member AddColumn : column:'TColumnKey * series:seq<'V> -> unit
member AddColumn : column:'TColumnKey * series:ISeries<'TRowKey> * lookup:Lookup -> unit
member AddColumn : column:'TColumnKey * series:seq<'V> * lookup:Lookup -> unit
...
Full name: Deedle.Frame<_,_>
--------------------
new : names:seq<'TColumnKey> * columns:seq<ISeries<'TRowKey>> -> Frame<'TRowKey,'TColumnKey>
static member Frame.ReadCsv : stream:IO.Stream * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int -> Frame<int,string>
static member Frame.ReadCsv : path:string * indexCol:string * ?hasHeaders:bool * ?inferTypes:bool * ?inferRows:int * ?schema:string * ?separators:string * ?culture:string * ?maxRows:int -> Frame<'R,string> (requires equality)
Full name: Deedle.Frame.pivotTable
member ObjectSeries.GetAs : column:'K * fallback:'R -> 'R
val string : value:'T -> string
Full name: Microsoft.FSharp.Core.Operators.string
--------------------
type string = String
Full name: Microsoft.FSharp.Core.string
Full name: Microsoft.FSharp.Core.bool
Full name: Deedle.Frame.countRows
member Frame.PivotTable : r:'TColumnKey * c:'TColumnKey * op:(Frame<'TRowKey,'TColumnKey> -> 'T) -> Frame<'R,'C> (requires equality and equality and equality and equality)
Full name: Deedle-v1.msft
type DateTime =
struct
new : ticks:int64 -> DateTime + 10 overloads
member Add : value:TimeSpan -> DateTime
member AddDays : value:float -> DateTime
member AddHours : value:float -> DateTime
member AddMilliseconds : value:float -> DateTime
member AddMinutes : value:float -> DateTime
member AddMonths : months:int -> DateTime
member AddSeconds : value:float -> DateTime
member AddTicks : value:int64 -> DateTime
member AddYears : value:int -> DateTime
...
end
Full name: System.DateTime
--------------------
DateTime()
(+0 other overloads)
DateTime(ticks: int64) : unit
(+0 other overloads)
DateTime(ticks: int64, kind: DateTimeKind) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, calendar: Globalization.Calendar) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, kind: DateTimeKind) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, calendar: Globalization.Calendar) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int, kind: DateTimeKind) : unit
(+0 other overloads)
static member count : frame:Frame<'R,'C> -> Series<'C,int> (requires equality and equality)
static member count : series:Series<'K,'V> -> int (requires equality)
static member expandingCount : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingKurt : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingMax : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingMean : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingMin : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingSkew : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingStdDev : series:Series<'K,float> -> Series<'K,float> (requires equality)
static member expandingSum : series:Series<'K,float> -> Series<'K,float> (requires equality)
...
Full name: Deedle.Stats
static member Stats.kurt : series:Series<'K,float> -> float (requires equality)
Full name: Deedle-v1.file
Full name: Deedle-v1.mean1
member Clone : unit -> obj
member CopyTo : array:Array * index:int -> unit + 1 overload
member GetEnumerator : unit -> IEnumerator
member GetLength : dimension:int -> int
member GetLongLength : dimension:int -> int64
member GetLowerBound : dimension:int -> int
member GetUpperBound : dimension:int -> int
member GetValue : [<ParamArray>] indices:int[] -> obj + 7 overloads
member Initialize : unit -> unit
member IsFixedSize : bool
...
Full name: System.Array
Full name: Microsoft.FSharp.Collections.Array.average
Full name: Deedle-v1.mean2
Full name: Deedle.Frame.groupRowsByInt
Full name: Deedle.Frame.getCol
Full name: Microsoft.FSharp.Core.Operators.fst
module Series
from Deedle
--------------------
type Series =
static member ofNullables : values:seq<Nullable<'a0>> -> Series<int,'a0> (requires default constructor and value type and 'a0 :> ValueType)
static member ofObservations : observations:seq<'a0 * 'a1> -> Series<'a0,'a1> (requires equality)
static member ofOptionalObservations : observations:seq<'K * 'a1 option> -> Series<'K,'a1> (requires equality)
static member ofValues : values:seq<'a0> -> Series<int,'a0>
Full name: Deedle.FSharpSeriesExtensions.Series
--------------------
type Series<'K,'V (requires equality)> =
interface IFsiFormattable
interface ISeries<'K>
new : pairs:seq<KeyValuePair<'K,'V>> -> Series<'K,'V>
new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>
new : index:IIndex<'K> * vector:IVector<'V> * vectorBuilder:IVectorBuilder * indexBuilder:IIndexBuilder -> Series<'K,'V>
member After : lowerExclusive:'K -> Series<'K,'V>
member Aggregate : aggregation:Aggregation<'K> * observationSelector:Func<DataSegment<Series<'K,'V>>,KeyValuePair<'TNewKey,OptionalValue<'R>>> -> Series<'TNewKey,'R> (requires equality)
member Aggregate : aggregation:Aggregation<'K> * keySelector:Func<DataSegment<Series<'K,'V>>,'TNewKey> * valueSelector:Func<DataSegment<Series<'K,'V>>,OptionalValue<'R>> -> Series<'TNewKey,'R> (requires equality)
member AsyncMaterialize : unit -> Async<Series<'K,'V>>
member Before : upperExclusive:'K -> Series<'K,'V>
...
Full name: Deedle.Series<_,_>
--------------------
new : pairs:seq<Collections.Generic.KeyValuePair<'K,'V>> -> Series<'K,'V>
new : keys:seq<'K> * values:seq<'V> -> Series<'K,'V>
new : index:Indices.IIndex<'K> * vector:IVector<'V> * vectorBuilder:Vectors.IVectorBuilder * indexBuilder:Indices.IIndexBuilder -> Series<'K,'V>
Full name: Deedle.Series.observations
Full name: Deedle-v1.r1
Full name: Deedle.FSharpSeriesExtensions.series
Full name: Deedle-v1.r2
Full name: Deedle-v1.r3
Full name: Deedle-v1.( Merge 3 unordered 300k long series (repeating Merge) )
member Series.Merge : otherSeries:seq<Series<'K,'V>> -> Series<'K,'V>
member Series.Merge : otherSeries:Series<'K,'V> -> Series<'K,'V>
member Series.Merge : another:Series<'K,'V> * behavior:UnionBehavior -> Series<'K,'V>
Published: Tuesday, 27 May 2014, 4:41 PM
Author: Tomas Petricek
Typos: Send me a pull request!
Tags: f#, deedle, data science