Home > Mobile >  Forward-only one-pass stateful IEnumerable<T> to IEnumerable<IEnumerable<T>> with
Forward-only one-pass stateful IEnumerable<T> to IEnumerable<IEnumerable<T>> with

Time:11-09

I want to take streamed IEnumerable values such as: (Tuples shown for illustration purposes. The actual application will be streaming DataRecords from a DataReader)

var tuples = new(int, int)[]
{
    (0, 0), (0, 1), (0, 2), (0, 3), (1, 0), (1, 1), (2, 0), (2, 1), (2, 2),
};

I want to maintain state and watch for each change in the left field. Each group of items with the left field in common should be returned as a series of IEnumerables (the data will be pre-sorted so I don't have to worry about collating on the local end).

(0,0) (0,1) (0,2) (0,3)
(1,0) (1,1)
(2,0) (2,1) (2,2)

It may not be possible, but the hope is to do this without creating any per-group interim Lists that would bank records in RAM because each group will be fairly large. In other words, somehow get each group to somehow magically siphon from the original IEnumerable such that it is completely forward-only, one-pass.

An inner-TakeWhile seemed like it might be the way to go, but it always restarts the iteration on tg from ground zero.

            private int currentGroup;

            public IEnumerator<IEnumerable<Tuple<int, int>>> GetEnumerator()
            {
                var tg = TupleGenerator();
                foreach (Tuple<int, int> item in tg)
                {
                    currentGroup = item.Item1;
                    yield return tg.TakeWhile((x) => x.Item1 == currentGroup);
                }
            }

            static IEnumerable<Tuple<int, int>> TupleGenerator()
            {
                for (int i = 0; i < 10; i  )
                {
                    for (int j = 0; j < 10; j  )
                    {
                        yield return new Tuple<int, int>(i,j);
                    }
                }
            }

CodePudding user response:

So while it is possible to avoid storing an entire groups worth of data in memory, and to only ever stream each item as you request it with only a constant amount of memory, there are downsides. It's important to realize why buffering is the typical solution to problems like this. First, the code to avoid buffering is just more complex. Second, it's essential that the consumer always iterate every inner IEnumerable to completion before ever requesting the next group, and that no inner IEnumerable ever be start being iterated more than one time. If you violate any of those rules (if you don't check for them explicitly) things will just silently be wrong. (You'll end up with data that should be in the same group in multiple groups, have the wrong number of groups, etc.) Given how easy those mistakes are to make, and the consequences of messing up, it's really worth explicitly checking for them and throwing an exception, so the consumer at least knows it's wrong and needs to be fixed.

As for the actual code, you really need to do a lot manually, to have direct control over the enumerator and when exactly you get items, but at it's core the code mostly just boils down to keeping track of the previous item and continuing to yield items into the same group as long as the current item "matches" with it.

public static IEnumerable<IEnumerable<T>> GroupWhile<T>(
    this IEnumerable<T> source,
    Func<T, T, bool> predicate)
{
    using (var iterator = source.GetEnumerator())
    {
        bool previousGroupFinished = true;
        bool sourceExhaused = !iterator.MoveNext();
        while (!sourceExhaused)
        {
            if (!previousGroupFinished)
                throw new InvalidOperationException("It is not valid to request the next group until the previous group has run to completion");
            previousGroupFinished = false;
            bool startedIteratingCurrentGroup = false;

            yield return NextGroup();

            IEnumerable<T> NextGroup()
            {
                if (startedIteratingCurrentGroup)
                    throw new InvalidOperationException("This sequence doesn't support being iterated multiple times.");
                startedIteratingCurrentGroup = true;

                T previous;
                do
                {
                    yield return iterator.Current;
                    previous = iterator.Current;
                    sourceExhaused = !iterator.MoveNext();
                }
                while (!sourceExhaused && predicate(previous, iterator.Current));
                previousGroupFinished = true;
            }
        }
    }
}

To use it in your case, it's pretty simple, your items are grouped while the first item in the pair is equal, but you can use any arbitrary condition that you want for grouping.

var data = new[] { (0, 0), (0, 1), (0, 2), (0, 3), (1, 0), (1, 1), (2, 0), (2, 1), (2, 2) };
var grouped = data.GroupWhile((previous, current) => previous.Item1 == current.Item1);
foreach (var group in grouped)
{
    Console.WriteLine(String.Join(", ", group));
}

Rather than grouping on a predicate, it's also often convenient to group on some key object. In your case for your example, the key is just the first item in the tuple. But if computing the key is expensive or otherwise can't be computed multiple times, you can alter the grouping mechanism to have a key selector instead of a predicate, and to store the previous key rather than the previous item. It results in code that's fairly similar, but subtly different:

public static IEnumerable<IEnumerable<TSource>> GroupAdjacent<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> keyComparer = null)
{
    keyComparer = keyComparer ?? EqualityComparer<TKey>.Default;

    using (var iterator = source.GetEnumerator())
    {
        bool previousGroupFinished = true;
        bool sourceExhaused = !iterator.MoveNext();
        TKey nextKey = keySelector(iterator.Current);
        while (!sourceExhaused)
        {
            if (!previousGroupFinished)
                throw new InvalidOperationException("It is not valid to request the next group until the previous group has run to completion");
            previousGroupFinished = false;
            bool startedIteratingCurrentGroup = false;

            yield return NextGroup();

            IEnumerable<TSource> NextGroup()
            {
                if (startedIteratingCurrentGroup)
                    throw new InvalidOperationException("This sequence doesn't support being iterated multiple times.");
                startedIteratingCurrentGroup = true;

                TKey previousKey;
                do
                {
                    yield return iterator.Current;
                    sourceExhaused = !iterator.MoveNext();
                    previousKey = nextKey;
                    if (!sourceExhaused)
                        nextKey = keySelector(iterator.Current);
                }
                while (!sourceExhaused && keyComparer.Equals(previousKey, nextKey));
                previousGroupFinished = true;
            }
        }
    }
}

Allowing you to write:

var data = new[] { (0, 0), (0, 1), (0, 2), (0, 3), (1, 0), (1, 1), (2, 0), (2, 1), (2, 2) };
var grouped = data.GroupAdjacent(pair => pair.Item1);
foreach (var group in grouped)
{
    Console.WriteLine(String.Join(", ", group));
}

CodePudding user response:

Using MoreLinq's GroupAdjacent method:

Groups the adjacent elements of a sequence according to a specified key selector function.

var groupings = GetValues().GroupAdjacent(tuple => tuple.Item1);

foreach (var grouping in groupings)
{
    Console.WriteLine($"Value: {grouping.Key}. Elements: {string.Join(", ", grouping)}");
}

IEnumerable<(int, int)> GetValues()
{
    yield return (0, 0);
    yield return (0, 1);
    yield return (0, 2);
    yield return (0, 3);
    yield return (1, 0);
    yield return (1, 1);
    yield return (2, 0);
    yield return (2, 1);
    yield return (2, 2);
};

This outputs the following:

Value: 0. Elements: (0, 0), (0, 1), (0, 2), (0, 3)

Value: 1. Elements: (1, 0), (1, 1)

Value: 2. Elements: (2, 0), (2, 1), (2, 2)

Each instance on the groupings enumerable is an IGrouping<int, (int, int)> which when enumerated results in what you need.

(P.S: the iterator is only hit once with this implementation, so it is completely forward-only)

  • Related