TP

Using custom grouping operators in LINQ

You can use LINQ to write queries that perform grouping of data using group by or ordering of data using orderby clause. LINQ provides the default (and the most common) implementation of both of the operations, but sometimes you may need a slightly different behavior when grouping or ordering data (this article is motivated by a question on StackOverflow which needs to do exactly that for grouping).

Let's look at a simple example, which shows when we may need a different behavior when grouping data. For example, we may have the following list of stock trades containing a name of a stock and the price of the trade (stored for example as a list of TradeInfo classes with properties Name and Price):

{ { Name = "MSFT", Price = 80.00 },
  { Name = "MSFT", Price = 70.00 },
  { Name = "GOOG", Price = 100.00 },
  { Name = "GOOG", Price = 200.00 },
  { Name = "GOOG", Price = 300.00 },
  { Name = "MSFT", Price = 30.00 },
  { Name = "MSFT", Price = 20.00 } }

Now, we may want to group adjacent trades into a single summary record which will contain the name of the stock (which is same for all trades in each group), the number of trades in the group and an average price in the group. The desired results are:

{ { Name = "MSFT", Count = 2, AvgPrice = 75.00 },
  { Name = "GOOG", Count = 3, AvgPrice = 200.00 },
  { Name = "MSFT", Count = 2, AvgPrice = 25.00 } }

The operation that we want to do is very similar to group by in LINQ, but it doesn't do quite the same thing! If we used group by, we would get only two groups as the result. However, as I wrote earlier, we want to group only adjacent trades. You could write your own extension method to do this, but then you need to leave the elegant LINQ query syntax. In this article, I'll show you how to get the desired results using a simple LINQ query with a group by clause...

Customizing LINQ queries

Let's start by looking how we can customize the meaning of LINQ queries. In fact, you may have already seen this - for example the AsParallel method from PLINQ does exactly that. Anyway, when you write a LINQ query, it is translated by the C# compiler to a sequence of method calls. The following query groups trades using the standard GroupBy operation provided by LINQ (so the result will be only two groups):

var agg = 
  from t in trades
  group t by t.Name into g
  select new { Name = g.Key, Count = g.Count(), 
               AvgPrice = g.Average(t => t.Price) };

Following the rules described in C# specification or using Reflector, we can see what the compiler does with the above query. The use of group by clause is translated in a call to GroupBy (an argument is a lambda expression that selects the key we're using for grouping) and the use of select is translated into a call to Select method (an argument is lambda expression returning the anonymous type):

var agg = 
  trades.GroupBy(t => t.Name).Select(g =>
    new { Name = g.Key, Count = g.Count(), 
          AvgPrice = g.Average(t => t.Price) });

This translation is done without checking whether the methods actually exist and what their type is. In this example, the type of trades is IEnumerable<TradeInfo>. When compiling the call to GroupBy, the compiler will first look for instance methods of the interface. It doesn't provide any GroupBy method, so it will try finding some extension method and it will use Enumerable.GroupBy, which is an extension method for the IEnumerable<T> type.

Now, what can we do if we want to use a different GroupBy method? We need to instruct the compiler to select a different extension method. We can implement a very simple method WithAdjacentGrouping which takes IEnumerable<T> and returns some other interface (we'll call it IAdjacentGrouping<T>). The implementation of the inteface is just a wrapper of IEnumerable<T>, but it means that C# compiler will use a different type when searching for the GroupBy method:

var agg = 
  trades.WithAdjacentGrouping()
        .GroupBy(...).Select(...);

We'll provide our own implementation of the GroupBy method, which groups only adjacent elements of the input sequence. The method will take an argument of type IAdjacentGrouping<T>, so when the compiler analyzes the code above, it will use our method instead of the standard one, which is available in the core libraries (LINQ to Objects). And of course, this will also work with the LINQ query syntax, because that is simply translated to method calls. We'll look at some nice queries shortly, but let's first implement the required interface and GroupBy method.

Implementing adjacent grouping

To implement all the machinery that allows us to use custom GroupBy method, we need to declare the IAdjacentGrouping<T> interface (including a concrete class implementing it) and we'll also need a class which implements the IGrouping<T> interface (which represents a group of elements with a Key). Once that's done, we'll need two extension methods - our customized GroupBy and a method that instructs the compiler to use it (WithAdjacentGrouping).

Interfaces and classes

Let's start with the IAdjacentGrouping<T> interface. We inherit it from IEnumerable<T>, which means that all other LINQ query operators (other than group by) will use the standard implementation. This will have some unfortunate consequences. All LINQ query operators return IEnumerable<T>, so if we use any other LINQ operator before our GroupBy, the query will "not remember" our non-standard grouping. This can be solved by providing our own implementations of other operators (and we'll discuss this in details later). Other than inheriting from IEnumerable<T> our new interface will not have any members, because we need it only to carry type information through the query:

interface IAdjacentGrouping<T> : IEnumerable<T> { 
} 

class WrappedAdjacentGrouping<T> : IAdjacentGrouping<T> { 
  public IEnumerable<T> Wrapped { get; set; } 
  
  public IEnumerator<T> GetEnumerator() {  
    return Wrapped.GetEnumerator();  
  } 
  IEnumerator IEnumerable.GetEnumerator() {
    return (IEnumerator)GetEnumerator();
  } 
}

The class WrappedAdjacentGrouping<T> is a simple implementation of our new interface. It wraps an IEnumerable<T> value and delegates all operations to the wrapped type, so this is pretty uninteresting boilerplate code.

We'll need one more trivial class. A typical grouping operation takes a list of elements and returns a list of lists (that is, a list of groups where each group consists of one or more inidividual elements). In LINQ, this is usually done by returning a value of type IEnumerable<IGrouping<TKey, T>>. The IGrouping<TKey, T> type is just like IEnumerable<T> with one additional feature - it has a property Key, which returns the key used to identify the group (in our earlier example, the key would be the name of the stock such as GOOG or MSFT). Since IGrouping<TKey, T> is an interface and .NET libraries don't provide any simple implementation of the interface, we'll need to write our own:

class Grouping<K, T> : IGrouping<K, T> {
  public K Key { get; set; }
  public IEnumerable<T> Elements;

  public IEnumerator<T> GetEnumerator() {
    return Elements.GetEnumerator();
  }
  IEnumerator IEnumerable.GetEnumerator() {
    return (IEnumerator)GetEnumerator();
  }
}

The class stores elements of the group in a property Elements, which will be usually accessed via the IEnumerable<T> interface (both generic and non-generic GetEnumerator methods just delegate the operation to the wrapped collection). The class also has a property Key with a getter (required by the interface) and setter (so that we can use it easily). Now that we have all the boilerplate code, let's look at more interesting things. In the next section, we'll implement our custom grouping.

Implementing custom grouping

Our custom GroupBy method has exactly the same type signature as the GroupBy method provided by LINQ (with the only exception that it takes the IAdjacentGrouping<T> as the first argument). It implements the behavior discussed in the introduction. Instead of grouping all elements into groups based on the returned keys, it groups all adjacent elements with the same key from the input collection.

The implementation of this functionality is the only lengthy piece of code in this article. We'll need to store a collection of elements in the current group (grouped so far) with a key of the current group. Each time we move to the next element, we'll check if it has the same key as the current group or not. If the key didn't change, we'll just add the element to the current group and continue. If the key changes, we'll return the previous group and start a new one. We also need to deal specially with the first element:

public static IEnumerable<IGrouping<K, T>> GroupBy<T, K>
    (this IAdjacentGrouping<T> source, Func<T, K> keySelector) where K : IComparable { 
  // Remembers elements of the current group
  List<T> elementsSoFar = new List<T>(); 
  IEnumerator<T> en = source.GetEnumerator(); 
  
  // Read the first element (we need an initial key value)
  if (en.MoveNext()) { 
    K lastKey = keySelector(en.Current); 
    do {  
      // Test whether current element starts a new group
      K newKey = keySelector(en.Current); 
      if (newKey.CompareTo(lastKey) != 0) 
      {  
        // Yield the previous group and start next one
        yield return new Grouping<K, T> 
          { Elements = elementsSoFar, Key = lastKey };
        lastKey = newKey;
        elementsSoFar = new List<T>(); 
      } 
      
      // Add element to the current group
      elementsSoFar.Add(en.Current); 
    } 
    while (en.MoveNext()); 
    
    // Yield the last group of sequence
    yield return new Grouping<K, T> 
      { Elements = elementsSoFar, Key = lastKey }; 
  }
} 

We're using the IEnumerator<T> to iterate over the source elements, because this allows us to call the MoveNext once before we start looping (to get the key of the first element, which is also the key of the first group). Once we initialize the lastKey variable, we start looping until the source is exhausted. Note that our method has a generic constraint saying that the key should be IComparable. This allows us to compare keys (and decide whether to start a new group or not) using the CompareTo method.

The last thing we need to do is to implement the WithAdjacentGrouping method, which instructs the compiler to use our GroupBy. As discussed earlier, the method will change the type of the collection from IEnumerable<T> to our type IAdjacentGrouping<T>, so that the compiler will prefer our GroupBy method (because it is an extension method directly for the IAdjacentGrouping<T> type):

public static IAdjacentGrouping<T> WithAdjacentGrouping<T>(this IEnumerable<T> e) { 
  return new WrappedAdjacentGrouping<T> { Wrapped = e }; 
} 

The extension method is trivial. It simply returns our concrete implementation of the interface, which wraps an IEnumerable<T>. This was the last missing piece that we needed to implement, before we could use our grouping operation in LINQ queries, so let's look at a couple of examples showing how this can be used.

Grouping trades and other examples

First of all, let's look at the example, which I presented as a motivation at the beginning of this article. We can use our new extension method WithAdjacentGrouping to change the meaning of a group by clause in a query. When we use the extension method, the query will group only adjacent elements, which is exactly what we wanted:

var groups =
  from t in trades.WithAdjacentGrouping()
  group t by t.Name into g
  select new {
    Name = g.Key, Count = g.Count(),
    AvgPrice = g.Average(t => t.Price) };

The query uses the value specified in the by clause to decide whether to start a new group or whether an element belongs to the current group. This means that it will start a new group each time the t.Name value changes. When that happens, it will use the select clause to generate aggregate information about the group. In this case, we return the number of elements and an average price in the group. If you run the query with the input data from the beginning of the article, you'll get the following result:

{ { Name = "MSFT", Count = 2, AvgPrice = 75.00 },
  { Name = "GOOG", Count = 3, AvgPrice = 200.00 },
  { Name = "MSFT", Count = 2, AvgPrice = 25.00 } }

I'm sure you can imagine other situations when this grouping technique would be useful. It also seems to be useful when processing Open XML documents as mentioned by Eric White (who shows how to implement this behavior using an ordinary extension method). However, we'll look at one more interesting aspect of this implementation of grouping - the fact that it can work with infinite sequences.

Grouping prime numbers

An infinite sequence is an IEnumerator<T> that always returns true when you call its MoveNext method. They do not represent elements of a collection (because an infinite collection wouldn't fit in a memory!), but we can easily generate them in C# using the yield return keyword.

The usual implementation of GroupBy cannot work on infinite sequences, because it needs to see all elements of the sequence, before it can give any result (we can't return any group early, because there may still be some element that belongs to that group in the rest of the sequence). However, our implementation which groups only adjacent elements return a group immediately when the key calculated for the current element changes. This means that it needs to see only a limited number of elements before returning the next group.

Let's look at an example that shows how we can use this property in practice. The following code shows how to generate an infinite IEnumerable<long> value, which will contain prime numbers (it is still somehow limited, because we're using long, but we could use for example the new BigInteger from .NET 4.0):

static bool IsPrime(long n) {
  long max = (long)Math.Sqrt(n);
  for (long i = 2; i <= max; i++)
    if (n % i == 0) return false;
  return true;
}
static IEnumerable<long> Primes() {
  for (long n = 0; true; n++)
    if (IsPrime(n)) yield return n;
}

Now, let's say that we want to count the number of primes in groups of 100000 numbers. It is a well known fact that for intervals of the same length (in our case, 100000) the number of primes will be larger for smaller numbers (e.g. interval from 0 to 100000) and smaller once we move to larger numbers (e.g. interval from 500000 to 600000). We can verify this fact using the following query:

var primeGroups =
  from n in Primes().WithAdjacentGrouping()
  group n by (n / 100000) into g
  select g.Count();

If you take first 10 results of the query (for example using Take(10)), it will give the following numbers:

9594, 8392, 8013, 7863, 7678, 7560, 7445, 7408, 7323, 7224

If you wanted to take all results of the query (e.g. using just foreach), it would continue printing results forever (and get slower and slower, because testing whether a large number is prime can take quite long). If you forget to add the WithAdjacentGrouping method and run the code using the ordinary GroupBy method, it will never print any result (because the standard GroupBy operator tries to read all numbers from the infinite sequence).

Ascending and descending groups

So far, we have been using group by to group adjacent elements with the same key. The standard LINQ implementation gives us group by which groups all elements using the given key. However, we can imagine other ways to group elements. For example, you could create groups of values for which the value of the key is ascending or descending. This can be used for example to simply show trends in a sequence of data.

Let's say we have a sequence (ordered by time) which contains stock prices (of a single stock) or for example currency exchange rate. The data may look like this:

{ { Price = 80.00, Time = 8:00 },
  { Price = 70.00, Time = 9:00 },
  { Price = 50.00, Time = 10:00 },
  { Price = 55.00, Time = 11:00 },
  { Price = 60.00, Time = 12:00 },
  { Price = 75.00, Time = 13:00 },
  { Price = 65.00, Time = 14:00 } }

Now, you can see that there is a descending sequence from 8:00 to 10:00, then we have an ascending sequence from 10:00 to 13:00 and finally, there is a short descending sequence from 13:00 to 14:00. If we want to simply visualize these trends, we would like to aggregate these three groups into the following result:

{ { Difference = -30.00, Interval = 2:00 },
  { Difference = +25.00, Interval = 3:00 },
  { Difference = -10.00, Interval = 1:00 } }

This operation is quite similar to the one which motivated this article, but it is different. We cannot implement it using our WithAdjacentGrouping method. However, after reading this article, you'd be able to follow the pattern and define your own extension method (for example WithTendencyGrouping) and your own custom implementation of GroupBy, which implements the behavior we just described. Then you could write the following code:

var trens = 
  from s in stocks.WithTendencyGrouping()
  group s by s.Price into g
  select new {  
    Difference = g.Last().Price - g.First().Price,
    Interval = g.Last().Time - g.First().Time };

With an appropriate custom implementation of GroupBy method, this code would give the results we've seen earlier. As you can see, the group by clause can be customized to perform various useful tasks.

Summary

In this article, we discussed two alternative implementations of the GroupBy operator (in addition to the standard implementation provided by LINQ). We've seen that we can change the meaning of LINQ query syntax to use our own implementation of GroupBy. Then we can write readable and elegant LINQ queries to group data in a non-standard way (for example by grouping adjacent elements with the same value of the key). You could use the techniques discussed in this article to customize other clauses of LINQ queries (for example the orderby clause).

We didn't discuss one important thing. Once you use other operator in the LINQ query (for example where), the compiler "forgets" that we wanted to use custom GroupBy method. This happens because it selects the standard Where method and gives it our wrapped sequence of type IAdjacentGrouping<T>. However, the Where method returns the result of type IEnumerable<T>, so if the compiler needs to invoke GroupBy on the result, it will pick a standard overload. This problem can be easily solved by providing our own implementations for all the standard LINQ query operators. For example, we would implement a Where method, which returns the result as IAdjacentGrouping<T>. The implementations of these methods would be quite easy (because they just wrap standard methods), but it would make the code rather lengthfy, so I didn't include them in the article.

Finally, this article has been inspired by a question on StackOverflow and by a paper by Philip Wadler and Simon Peyton Jones about Haskell comprehensions (Haskell comprehensions were one of the motivations for LINQ). The paper adds support for order and group to the comprehension syntax (LINQ was one of the motivations of the paper). In Haskell, you can provide your own function directly in the syntax. Here is a Haskell version of our original example with grouping stock trades:

[ (the stock, length stock, average price)
| (stock, price) <- trades
, group by stock using groupRun ]

The important thing about the example that it uses using clause which allows you to specify your own grouping function. In this case, the grouping function is groupRun, which implements the same functionality as our custom GroupBy grouping adjacent elements. The paper also states that in LINQ, it isn't possible to provide custom grouping (or ordering) function. I believe that this article shows that this isn't quite true - you can do similar thing in LINQ. However, there are many limitaions in LINQ. In particular, we can specify one grouping function for the whole query and cannot change it for each group by clause.

Downloads & References

Published: Sunday, 7 February 2010, 8:13 PM
Author: Tomas Petricek
Typos: Send me a pull request!
Tags: academic, c#, parallel