Parsing an iterable without listifying each chunk-CodePudding

Suppose I want to achieve a splitting of a Python iterable, without listifying each chunk, similar to itertools.groupby, whose chunks are lazy. But I want to do it on a more sophisticated condition than equality of a key. So more like a parser.

For example, suppose I want to use odd numbers as delimiters in an iterable of integers. Like more_itertools.split_at(lambda x: x % 2 == 1, xs). (But more_itertools.split_at listifies each chunk.)

In parser combinator language this might be called sepBy1(odd, many(even)). In Haskell there are the Parsec, pipes-parse and pipes-group libraries which address this kind of problem. For instance, it would be sufficient and interesting to write an itertools.groupby-like version of groupsBy' from Pipes.Group (see here).

There could probably be some clever jiu jitsu with itertools.groupby, perhaps applying itertools.pairwise, then itertools.groupby, and then going back to single elements.

I could write it myself as a generator, I suppose, but writing itertools.groupby in Python (below) is already pretty involved. Also not readily generalizable.

Seems like there should be something for this more generally, like a relatively painless way of writing parsers and combinators for streams of whatever type.

# From https://docs.python.org/3/library/itertools.html#itertools.groupby
# groupby() is roughly equivalent to:
class groupby:
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.tgtkey = self.currkey = self.currvalue = object()
    def __iter__(self):
        return self
    def __next__(self):
        self.id = object()
        while self.currkey == self.tgtkey:
            self.currvalue = next(self.it)    # Exit on StopIteration
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey, self.id))
    def _grouper(self, tgtkey, id):
        while self.id is id and self.currkey == tgtkey:
            yield self.currvalue
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                return
            self.currkey = self.keyfunc(self.currvalue)

CodePudding user response：

Here are a couple of simple iterator splitters, which I wrote in a fit of boredom. I don't think they're particularly profound, but perhaps they'll help in some way.

I didn't spend a lot of time thinking about useful interfaces, optimisations, or implementing multiple interacting sub-features. All of that stuff could be added, if desired.

These are basically modelled on itertools.groupby, whose interface could be considered a bit weird. It's the consequence of Python really not being a functional programming language. Python's generators (and other objects which implement the iterator protocol) are stateful and there is no facility for saving and restoring generator state. So the functions do return an iterator which successively generates iterators, which produce values from the original iterator. But the returned iterators share the underlying iterable, which is the iterable passed in to the original call, which means that when you advance the outer iterator, any unconsumed values in the current inner iterator are discarded without notice.

There are (fairly expensive) ways to avoid discarding the values, but since the most obvious one --listifying-- was ruled out from the start, I just went with the groupby interface despite the awkwardness of accurately documenting the behaviour. It would be possible to wrap the inner iterators with itertools.tee in order to make the original iterators independent, but that comes at a price similar to (or possibly slightly greater than) listifying. It still requires each sub-iterator to be fully generated before the next sub-iterator is started, but it doesn't require the sub-iterator to be fully generated before you start using values.

For simplicity (according to me :-) ), I implemented these functions as generators rather than objects, as with itertools and more_itertools. The outer generator yields each successive subiterator and then collects and discards any remaining values from it before yielding the next subiterator [Note 1]. I imagine that most of the time the subiterator will be fully exhausted before the outer loop tries to flush it, so the additional call will be a bit wasteful, but it's simpler than the code you cite for itertools.groupby.

It's still necessary to communicate back from the subiterator the fact that the original iterator was exhausted, since that's not something you can ask an iterator about. I use a nonlocal declaration to share state between the outer and the inner generators. In some ways, maintaining state in an object, as itertools.groupby does, might be more flexible and maybe even be considered more Pythonic, but nonlocal worked for me.

I implemented more_itertools.split_at (without maxsplits and keep_separator options) and what I think is equivalent of Pipes.Groups.groupBy', renamed as split_between to indicate that it splits between two consecutive elements if they satisfy some condition.

Note that split_between always forces the first value from the supplied iterator before it has been requested by running the first subiterator. The rest of the values are generated lazily. I tried a few ways to defer the first object, but in the end I went with this design because it's a lot simpler. The consequence is that split_at, which doesn't do the initial force, always returns at least one subiterator, even if the supplied argument is empty, whereas split_between does not. I'd have to try both of these for some real problem in order to decide which interface I prefer; if you have a preference, by all means express it (but no guarantees about changes).

from collections import deque

def split_at(iterable, pred=lambda x:x is None):
    '''Produces an iterator which returns successive sub-iterations of 
       `iterable`, delimited by values for which `pred` returns
       truthiness. The default predicate returns True only for the
       value None.

       The sub-iterations share the underlying iterable, so they are not 
       independent of each other. Advancing the outer iterator will discard
       the rest of the current sub-iteration.

       The delimiting values are discarded.
    '''

    done = False
    iterable = iter(iterable)

    def subiter():
        nonlocal done
        for value in iterable:
            if pred(value): return
            yield value
        done = True

    while not done:
        yield (g := subiter())
        deque(g, maxlen=0)

def split_between(iterable, pred=lambda before,after:before   1 != after):
    '''Produces an iterator which returns successive sub-iterations of 
       `iterable`, delimited at points where calling `pred` on two
       consecutive values produces truthiness. The default predicate
       returns True when the two values are not consecutive, making it
       possible to split a sequence of integers into contiguous ranges.

       The sub-iterations share the underlying iterable, so they are not 
       independent of each other. Advancing the outer iterator will discard
       the rest of the current sub-iteration.
    '''
    iterable = iter(iterable)

    try:
        before = next(iterable)
    except StopIteration:
        return

    done = False

    def subiter():
        nonlocal done, before
        for after in iterable:
            yield before
            prev, before = before, after
            if pred(prev, before):
                return

        yield before
        done = True

    while not done:
        yield (g := subiter())
        deque(g, maxlen=0)

Notes

collections.deque(g, maxlen=0) is, I believe, currently the most efficient way of discarding the remaining values of an iterator, although it looks a bit mysterious. Credits to more_itertools for pointing me at that solution, and the related expression to count the number of objects produced by a generator:
```
cache[0][0] if (cache := deque(enumerate(it, 1), maxlen=1)) else 0
```
Although I don't mean to blame more_itertools for the above monstrosity. (They do it with an if statement, not a walrus.)