Home > OS >  Understanding behaviour of custom Linq Chunk and IEnumerable<IEnumerable<T>>
Understanding behaviour of custom Linq Chunk and IEnumerable<IEnumerable<T>>

Time:11-01

I tried to implement custom Linq Chunk function and found this code example This function should separate IEnumerable into IEnumerable of concrete size

public static class EnumerableExtentions
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
    {
        using (var enumerator = source.GetEnumerator())
        {
            while (enumerator.MoveNext())
            {
                int i = 0;
                IEnumerable<T> Batch()
                {
                    do yield return enumerator.Current;
                    while (  i < size && enumerator.MoveNext());
                }
                yield return Batch();
            }
        }
    }
}

So, I have a question.Why when I try to execute some Linq operation on the result, they are incorrect? For example:

IEnumerable<int> list = Enumerable.Range(0, 10);
Console.WriteLine(list.Batch(2).Count()); // 10 instead of 5

I have an assumption, that it happens because inner IEnumerable Batch() is only triggered when Count() is called, and something goes wrong there, but I don't know what exactly.

CodePudding user response:

You have created an interator in an iterator but only the outer iterator gets executed at the Count(). If you wanted to execute the inner you needed to enumerate it, for example:

var batches = list.Batch(3);
foreach(var batch in batches) // the outer is executed
{
    int count = batch.Count(); // the inner iterator is executed now
}

Wel, i would suggest a different approach for the Chunk method like this:

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
{
    T[]? bucket = null;
    var count = 0;

    foreach (var item in source)
    {
        bucket ??= new T[size];
        bucket[count  ] = item;

        if (count != size)
            continue;

        yield return bucket;

        bucket = null;
        count = 0;
    }

    if (count > 0)
    {
        Array.Resize(ref bucket, count);
        yield return bucket;
    }
}

CodePudding user response:

I have an assumption, that it happens because inner IEnumerable Batch() is only triggered when Count() is called

It's the opposite. The inner IEnumerable is not consumed, when you call Count. Count only consumes the outer IEnumerable, which is this one:

while (enumerator.MoveNext())
{
    int i = 0;
    IEnumerable<T> Batch()
    {
        // the below is not executed by Count!
        // do yield return enumerator.Current;
        // while (  i < size && enumerator.MoveNext());
    }
    yield return Batch();
}

So what Count would do is just move the enumerator to the end, and counts how many times it moved it, which is 10.

Compare that to how the author of this likely have intended this to be used:

foreach (var batch in someEnumerable.Batch(2)) {
    foreach(var thing in batch) {
        // ...
    }
}

I'm also consuming the inner IEnumerables using an inner loop, hence running the code inside the inner Batch. This yields the current element, then also moves the source enumerator forward. It yields the current element again before the i < size check fails. The outer loop is going to move forward the enumerator again for the next iteration. And that is how you have created a "batch" of two elements.

Notice that the "enumerator" (which came from someEnumerable) in the previous paragraph is shared between the inner and outer IEnumerables. Consuming either the inner or outer IEnumerable will move the enumerator, and it is only when you consume both the inner and outer IEnumerables in a very specific way, does the sequence of things in the previous paragraph happen, leading to you getting batches.

In your case, you can consume the inner IEnumerables by calling ToList:

Console.WriteLine(list.Batch(2).Select(x => x.ToList()).Count()); // 5

While sharing the enumerator here allows the batches to be lazily consumed, it limits the client code to only consume it in very specific ways. In the .NET 6 implementation of Chunk, the batches (chunks) are eagerly computed as arrays:

public static IEnumerable<TSource[]> Chunk<TSource>(this IEnumerable<TSource> source, int size)

You can do a similar thing in your Batch by calling ToArray() here:

yield return Batch().ToArray();

so that the inner IEnumerables are always consumed.

CodePudding user response:

Try this way :

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> arr, int size)
{
  for (var i = 0; i < arr.Count() / size   1; i  )
  {
    yield return arr.Skip(i * size).Take(size);
  }
}
  • Related