Using custom grouping operators in LINQ
You can use LINQ to write queries that perform grouping of data using group by
or ordering of data using orderby
clause. LINQ provides the default
(and the most common) implementation of both of the operations, but sometimes you
may need a slightly different behavior when grouping or ordering data (this article
is motivated by a question on StackOverflow
which needs to do exactly that for grouping).
Let's look at a simple example, which shows when we may need a different behavior when
grouping data. For example, we may have the following list of stock trades
containing a name of a stock and the price of the trade (stored for example as a list
of TradeInfo
classes with properties Name
and Price
):
{ { Name = "MSFT", Price = 80.00 },
{ Name = "MSFT", Price = 70.00 },
{ Name = "GOOG", Price = 100.00 },
{ Name = "GOOG", Price = 200.00 },
{ Name = "GOOG", Price = 300.00 },
{ Name = "MSFT", Price = 30.00 },
{ Name = "MSFT", Price = 20.00 } }
Now, we may want to group adjacent trades into a single summary record which will contain the name of the stock (which is same for all trades in each group), the number of trades in the group and an average price in the group. The desired results are:
{ { Name = "MSFT", Count = 2, AvgPrice = 75.00 },
{ Name = "GOOG", Count = 3, AvgPrice = 200.00 },
{ Name = "MSFT", Count = 2, AvgPrice = 25.00 } }
The operation that we want to do is very similar to group by
in LINQ, but
it doesn't do quite the same thing! If we used group by
, we would get only
two groups as the result. However, as I wrote earlier, we want to group only
adjacent trades. You could write your own extension method to do this,
but then you need to leave the elegant LINQ query syntax. In this article, I'll show
you how to get the desired results using a simple LINQ query with a group by
clause...
Customizing LINQ queries
Let's start by looking how we can customize the meaning of LINQ queries. In fact, you may have already
seen this - for example the AsParallel
method from PLINQ does exactly that.
Anyway, when you write a LINQ query, it is translated by the C# compiler to a sequence of
method calls. The following query groups trades using the standard GroupBy
operation provided by LINQ (so the result will be only two groups):
var agg =
from t in trades
group t by t.Name into g
select new { Name = g.Key, Count = g.Count(),
AvgPrice = g.Average(t => t.Price) };
Following the rules described in C# specification or using Reflector, we can see what the
compiler does with the above query. The use of group by
clause is translated in
a call to GroupBy
(an argument is a lambda expression that selects the key we're using for
grouping) and the use of select
is translated into a call to Select
method (an argument is lambda expression returning the anonymous type):
var agg =
trades.GroupBy(t => t.Name).Select(g =>
new { Name = g.Key, Count = g.Count(),
AvgPrice = g.Average(t => t.Price) });
This translation is done without checking whether the methods actually exist and what their type
is. In this example, the type of trades
is IEnumerable<TradeInfo>
.
When compiling the call to GroupBy
, the compiler will first look for instance methods
of the interface. It doesn't provide any GroupBy
method, so it will try finding
some extension method and it will use Enumerable.GroupBy
, which is an extension
method for the IEnumerable<T>
type.
Now, what can we do if we want to use a different GroupBy
method? We need to
instruct the compiler to select a different extension method. We can implement a very simple
method WithAdjacentGrouping
which takes IEnumerable<T>
and
returns some other interface (we'll call it IAdjacentGrouping<T>
). The
implementation of the inteface is just a wrapper of IEnumerable<T>
, but
it means that C# compiler will use a different type when searching for the
GroupBy
method:
var agg =
trades.WithAdjacentGrouping()
.GroupBy(...).Select(...);
We'll provide our own implementation of the GroupBy
method, which groups
only adjacent elements of the input sequence. The method will take an argument of type
IAdjacentGrouping<T>
, so when the compiler analyzes the code above,
it will use our method instead of the standard one, which is available in the core
libraries (LINQ to Objects). And of course, this will also work with the LINQ query
syntax, because that is simply translated to method calls. We'll look at some nice
queries shortly, but let's first implement the required interface and GroupBy
method.
Implementing adjacent grouping
To implement all the machinery that allows us to use custom GroupBy
method,
we need to declare the IAdjacentGrouping<T>
interface (including a
concrete class implementing it) and we'll also need a class which implements the
IGrouping<T>
interface (which represents a group of elements with
a Key
). Once that's done, we'll need two extension methods - our customized
GroupBy
and a method that instructs the compiler to use it (WithAdjacentGrouping
).
Interfaces and classes
Let's start with the IAdjacentGrouping<T>
interface. We inherit it from
IEnumerable<T>
, which means that all other LINQ query operators (other than
group by
) will use the standard implementation. This will have some unfortunate
consequences. All LINQ query operators return IEnumerable<T>
, so if we use
any other LINQ operator before our GroupBy
, the query will "not remember"
our non-standard grouping. This can be solved by providing our own implementations of other
operators (and we'll discuss this in details later). Other than inheriting from IEnumerable<T>
our new interface will not have any members, because we need it only to carry
type information through the query:
interface IAdjacentGrouping<T> : IEnumerable<T> {
}
class WrappedAdjacentGrouping<T> : IAdjacentGrouping<T> {
public IEnumerable<T> Wrapped { get; set; }
public IEnumerator<T> GetEnumerator() {
return Wrapped.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator() {
return (IEnumerator)GetEnumerator();
}
}
The class WrappedAdjacentGrouping<T>
is a simple implementation of our
new interface. It wraps an IEnumerable<T>
value and delegates all
operations to the wrapped type, so this is pretty uninteresting boilerplate code.
We'll need one more trivial class. A typical grouping operation takes a list of elements
and returns a list of lists (that is, a list of groups where each group consists of
one or more inidividual elements). In LINQ, this is usually done by returning a value
of type IEnumerable<IGrouping<TKey, T>>
. The
IGrouping<TKey, T>
type is just like IEnumerable<T>
with one additional feature - it has a property Key
, which returns the
key used to identify the group (in our earlier example, the key would be the name
of the stock such as GOOG
or MSFT
). Since
IGrouping<TKey, T>
is an interface and .NET libraries don't provide
any simple implementation of the interface, we'll need to write our own:
class Grouping<K, T> : IGrouping<K, T> {
public K Key { get; set; }
public IEnumerable<T> Elements;
public IEnumerator<T> GetEnumerator() {
return Elements.GetEnumerator();
}
IEnumerator IEnumerable.GetEnumerator() {
return (IEnumerator)GetEnumerator();
}
}
The class stores elements of the group in a property Elements
, which will be
usually accessed via the IEnumerable<T>
interface (both generic and
non-generic GetEnumerator
methods just delegate the operation to the wrapped
collection). The class also has a property Key
with a getter (required by the
interface) and setter (so that we can use it easily). Now that we have all the boilerplate
code, let's look at more interesting things. In the next section, we'll implement our
custom grouping.
Implementing custom grouping
Our custom GroupBy
method has exactly the same type signature as the
GroupBy
method provided by LINQ (with the only exception that it takes
the IAdjacentGrouping<T>
as the first argument). It implements the behavior discussed
in the introduction. Instead of grouping all elements into groups based on
the returned keys, it groups all adjacent elements with the same key from the
input collection.
The implementation of this functionality is the only lengthy piece of code in this article. We'll need to store a collection of elements in the current group (grouped so far) with a key of the current group. Each time we move to the next element, we'll check if it has the same key as the current group or not. If the key didn't change, we'll just add the element to the current group and continue. If the key changes, we'll return the previous group and start a new one. We also need to deal specially with the first element:
public static IEnumerable<IGrouping<K, T>> GroupBy<T, K>
(this IAdjacentGrouping<T> source, Func<T, K> keySelector) where K : IComparable {
// Remembers elements of the current group
List<T> elementsSoFar = new List<T>();
IEnumerator<T> en = source.GetEnumerator();
// Read the first element (we need an initial key value)
if (en.MoveNext()) {
K lastKey = keySelector(en.Current);
do {
// Test whether current element starts a new group
K newKey = keySelector(en.Current);
if (newKey.CompareTo(lastKey) != 0)
{
// Yield the previous group and start next one
yield return new Grouping<K, T>
{ Elements = elementsSoFar, Key = lastKey };
lastKey = newKey;
elementsSoFar = new List<T>();
}
// Add element to the current group
elementsSoFar.Add(en.Current);
}
while (en.MoveNext());
// Yield the last group of sequence
yield return new Grouping<K, T>
{ Elements = elementsSoFar, Key = lastKey };
}
}
We're using the IEnumerator<T>
to iterate over the source elements,
because this allows us to call the MoveNext
once before we start looping
(to get the key of the first element, which is also the key of the first group). Once
we initialize the lastKey
variable, we start looping until the source
is exhausted. Note that our method has a generic constraint saying that the key
should be IComparable
. This allows us to compare keys (and decide whether to
start a new group or not) using the CompareTo
method.
The last thing we need to do is to implement the WithAdjacentGrouping
method,
which instructs the compiler to use our GroupBy
. As discussed earlier, the method
will change the type of the collection from IEnumerable<T>
to our
type IAdjacentGrouping<T>
, so that the compiler will prefer our
GroupBy
method (because it is an extension method directly for the
IAdjacentGrouping<T>
type):
public static IAdjacentGrouping<T> WithAdjacentGrouping<T>(this IEnumerable<T> e) {
return new WrappedAdjacentGrouping<T> { Wrapped = e };
}
The extension method is trivial. It simply returns our concrete implementation of the interface,
which wraps an IEnumerable<T>
. This was the last missing piece that
we needed to implement, before we could use our grouping operation in LINQ queries, so let's
look at a couple of examples showing how this can be used.
Grouping trades and other examples
First of all, let's look at the example, which I presented as a motivation at the beginning
of this article. We can use our new extension method WithAdjacentGrouping
to change
the meaning of a group by
clause in a query. When we use the extension method,
the query will group only adjacent elements, which is exactly what we wanted:
var groups =
from t in trades.WithAdjacentGrouping()
group t by t.Name into g
select new {
Name = g.Key, Count = g.Count(),
AvgPrice = g.Average(t => t.Price) };
The query uses the value specified in the by
clause to decide whether
to start a new group or whether an element belongs to the current group. This means
that it will start a new group each time the t.Name
value changes. When
that happens, it will use the select
clause to generate aggregate
information about the group. In this case, we return the number of elements and
an average price in the group. If you run the query with the input data from the
beginning of the article, you'll get the following result:
{ { Name = "MSFT", Count = 2, AvgPrice = 75.00 },
{ Name = "GOOG", Count = 3, AvgPrice = 200.00 },
{ Name = "MSFT", Count = 2, AvgPrice = 25.00 } }
I'm sure you can imagine other situations when this grouping technique would be useful. It also seems to be useful when processing Open XML documents as mentioned by Eric White (who shows how to implement this behavior using an ordinary extension method). However, we'll look at one more interesting aspect of this implementation of grouping - the fact that it can work with infinite sequences.
Grouping prime numbers
An infinite sequence is an IEnumerator<T>
that always returns
true
when you call its MoveNext
method. They do not represent
elements of a collection (because an infinite collection wouldn't fit in a memory!), but
we can easily generate them in C# using the yield return
keyword.
The usual implementation of GroupBy
cannot work on infinite sequences,
because it needs to see all elements of the sequence, before it can give any result
(we can't return any group early, because there may still be some element that belongs
to that group in the rest of the sequence). However, our implementation which groups
only adjacent elements return a group immediately when the key calculated for the
current element changes. This means that it needs to see only a limited number of
elements before returning the next group.
Let's look at an example that shows how we can use this property in practice.
The following code shows how to generate an infinite IEnumerable<long>
value, which will contain prime numbers (it is still somehow limited, because we're
using long
, but we could use for example the new BigInteger
from .NET 4.0):
static bool IsPrime(long n) {
long max = (long)Math.Sqrt(n);
for (long i = 2; i <= max; i++)
if (n % i == 0) return false;
return true;
}
static IEnumerable<long> Primes() {
for (long n = 0; true; n++)
if (IsPrime(n)) yield return n;
}
Now, let's say that we want to count the number of primes in groups of 100000 numbers. It is a well known fact that for intervals of the same length (in our case, 100000) the number of primes will be larger for smaller numbers (e.g. interval from 0 to 100000) and smaller once we move to larger numbers (e.g. interval from 500000 to 600000). We can verify this fact using the following query:
var primeGroups =
from n in Primes().WithAdjacentGrouping()
group n by (n / 100000) into g
select g.Count();
If you take first 10 results of the query (for example using Take(10)
),
it will give the following numbers:
9594, 8392, 8013, 7863, 7678, 7560, 7445, 7408, 7323, 7224
If you wanted to take all results of the query (e.g. using just foreach
),
it would continue printing results forever (and get slower and slower, because testing
whether a large number is prime can take quite long). If you forget to add the
WithAdjacentGrouping
method and run the code using the ordinary
GroupBy
method, it will never print any result (because the standard
GroupBy
operator tries to read all numbers from the infinite sequence).
Ascending and descending groups
So far, we have been using group by
to group adjacent elements with the same key.
The standard LINQ implementation gives us group by
which groups all elements
using the given key. However, we can imagine other ways to group elements. For example,
you could create groups of values for which the value of the key is ascending or descending.
This can be used for example to simply show trends in a sequence of data.
Let's say we have a sequence (ordered by time) which contains stock prices (of a single stock) or for example currency exchange rate. The data may look like this:
{ { Price = 80.00, Time = 8:00 },
{ Price = 70.00, Time = 9:00 },
{ Price = 50.00, Time = 10:00 },
{ Price = 55.00, Time = 11:00 },
{ Price = 60.00, Time = 12:00 },
{ Price = 75.00, Time = 13:00 },
{ Price = 65.00, Time = 14:00 } }
Now, you can see that there is a descending sequence from 8:00 to 10:00, then we have an ascending sequence from 10:00 to 13:00 and finally, there is a short descending sequence from 13:00 to 14:00. If we want to simply visualize these trends, we would like to aggregate these three groups into the following result:
{ { Difference = -30.00, Interval = 2:00 },
{ Difference = +25.00, Interval = 3:00 },
{ Difference = -10.00, Interval = 1:00 } }
This operation is quite similar to the one which motivated this article, but it
is different. We cannot implement it using our WithAdjacentGrouping
method.
However, after reading this article, you'd be able to follow the pattern and define your
own extension method (for example WithTendencyGrouping
) and your own
custom implementation of GroupBy
, which implements the behavior we just
described. Then you could write the following code:
var trens =
from s in stocks.WithTendencyGrouping()
group s by s.Price into g
select new {
Difference = g.Last().Price - g.First().Price,
Interval = g.Last().Time - g.First().Time };
With an appropriate custom implementation of GroupBy
method, this code would
give the results we've seen earlier. As you can see, the group by
clause
can be customized to perform various useful tasks.
Summary
In this article, we discussed two alternative implementations of the GroupBy
operator (in addition to the standard implementation provided by LINQ). We've seen that
we can change the meaning of LINQ query syntax to use our own implementation of
GroupBy
. Then we can write readable and elegant LINQ queries to group
data in a non-standard way (for example by grouping adjacent elements with the same
value of the key). You could use the techniques discussed in this article to customize
other clauses of LINQ queries (for example the orderby
clause).
We didn't discuss one important thing. Once you use other operator in the LINQ query
(for example where
), the compiler "forgets" that we wanted to use custom
GroupBy
method. This happens because it selects the standard Where
method and gives it our wrapped sequence of type IAdjacentGrouping<T>
.
However, the Where
method returns the result of type IEnumerable<T>
,
so if the compiler needs to invoke GroupBy
on the result, it will pick a standard
overload. This problem can be easily solved by providing our own implementations for
all the standard LINQ query operators. For example, we would implement a Where
method,
which returns the result as IAdjacentGrouping<T>
. The implementations of these
methods would be quite easy (because they just wrap standard methods), but it would
make the code rather lengthfy, so I didn't include them in the article.
Finally, this article has been inspired by a question on StackOverflow
and by a paper by Philip Wadler and Simon Peyton Jones about Haskell comprehensions
(Haskell comprehensions were one of the motivations for LINQ). The paper adds support for order
and group
to the comprehension syntax (LINQ was one of the motivations of the paper). In Haskell, you can
provide your own function directly in the syntax. Here is a Haskell version of our original
example with grouping stock trades:
[ (the stock, length stock, average price)
| (stock, price) <- trades
, group by stock using groupRun ]
The important thing about the example that it uses using
clause which allows you to
specify your own grouping function. In this case, the grouping function is groupRun
,
which implements the same functionality as our custom GroupBy
grouping adjacent elements.
The paper also states that in LINQ, it isn't possible to provide custom grouping (or ordering)
function. I believe that this article shows that this isn't quite true - you can do similar thing
in LINQ. However, there are many limitaions in LINQ. In particular, we can specify one grouping
function for the whole query and cannot change it for each group by
clause.
Downloads & References
- [1] Can I use LINQ to retrieve only “on change” values? - Question at StackOverflow
- [2] The GroupAdjacent extension method - Eric White's blog
- [3] Comprehensive Comprehensions: Comprehensions with ‘Order by’ and ‘Group by’ - Philip Wadler, Simon Peyton Jones
Published: Sunday, 7 February 2010, 8:13 PM
Author: Tomas Petricek
Typos: Send me a pull request!
Tags: academic, c#, parallel