Suppose I want to achieve a splitting of a Python iterable, without listifying each chunk, similar to itertools.groupby
, whose chunks are lazy. But I want to do it on a more sophisticated condition than equality of a key. So more like a parser.
For example, suppose I want to use odd numbers as delimiters in an iterable of integers. Like more_itertools.split_at(lambda x: x % 2 == 1, xs)
. (But more_itertools.split_at
listifies each chunk.)
In parser combinator language this might be called sepBy1(odd, many(even))
. In Haskell there are the Parsec
, pipes-parse
and pipes-group
libraries which address this kind of problem. For instance, it would be sufficient and interesting to write an itertools.groupby
-like version of groupsBy'
from Pipes.Group (see here).
There could probably be some clever jiu jitsu with itertools.groupby
, perhaps applying itertools.pairwise
, then itertools.groupby
, and then going back to single elements.
I could write it myself as a generator, I suppose, but writing itertools.groupby
in Python (below) is already pretty involved. Also not readily generalizable.
Seems like there should be something for this more generally, like a relatively painless way of writing parsers and combinators for streams of whatever type.
# From https://docs.python.org/3/library/itertools.html#itertools.groupby
# groupby() is roughly equivalent to:
class groupby:
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
def __init__(self, iterable, key=None):
if key is None:
key = lambda x: x
self.keyfunc = key
self.it = iter(iterable)
self.tgtkey = self.currkey = self.currvalue = object()
def __iter__(self):
return self
def __next__(self):
self.id = object()
while self.currkey == self.tgtkey:
self.currvalue = next(self.it) # Exit on StopIteration
self.currkey = self.keyfunc(self.currvalue)
self.tgtkey = self.currkey
return (self.currkey, self._grouper(self.tgtkey, self.id))
def _grouper(self, tgtkey, id):
while self.id is id and self.currkey == tgtkey:
yield self.currvalue
try:
self.currvalue = next(self.it)
except StopIteration:
return
self.currkey = self.keyfunc(self.currvalue)
CodePudding user response:
Here are a couple of simple iterator splitters, which I wrote in a fit of boredom. I don't think they're particularly profound, but perhaps they'll help in some way.
I didn't spend a lot of time thinking about useful interfaces, optimisations, or implementing multiple interacting sub-features. All of that stuff could be added, if desired.
These are basically modelled on itertools.groupby
, whose interface could be considered a bit weird. It's the consequence of Python really not being a functional programming language. Python's generators (and other objects which implement the iterator protocol) are stateful and there is no facility for saving and restoring generator state. So the functions do return an iterator which successively generates iterators, which produce values from the original iterator. But the returned iterators share the underlying iterable, which is the iterable passed in to the original call, which means that when you advance the outer iterator, any unconsumed values in the current inner iterator are discarded without notice.
There are (fairly expensive) ways to avoid discarding the values, but since the most obvious one --listifying-- was ruled out from the start, I just went with the groupby
interface despite the awkwardness of accurately documenting the behaviour. It would be possible to wrap the inner iterators with itertools.tee
in order to make the original iterators independent, but that comes at a price similar to (or possibly slightly greater than) listifying. It still requires each sub-iterator to be fully generated before the next sub-iterator is started, but it doesn't require the sub-iterator to be fully generated before you start using values.
For simplicity (according to me :-) ), I implemented these functions as generators rather than objects, as with itertools
and more_itertools
. The outer generator yields each successive subiterator and then collects and discards any remaining values from it before yielding the next subiterator [Note 1]. I imagine that most of the time the subiterator will be fully exhausted before the outer loop tries to flush it, so the additional call will be a bit wasteful, but it's simpler than the code you cite for itertools.groupby
.
It's still necessary to communicate back from the subiterator the fact that the original iterator was exhausted, since that's not something you can ask an iterator about. I use a nonlocal
declaration to share state between the outer and the inner generators. In some ways, maintaining state in an object, as itertools.groupby
does, might be more flexible and maybe even be considered more Pythonic, but nonlocal
worked for me.
I implemented more_itertools.split_at
(without maxsplits
and keep_separator
options) and what I think is equivalent of Pipes.Groups.groupBy'
, renamed as split_between
to indicate that it splits between two consecutive elements if they satisfy some condition.
Note that split_between
always forces the first value from the supplied iterator before it has been requested by running the first subiterator. The rest of the values are generated lazily. I tried a few ways to defer the first object, but in the end I went with this design because it's a lot simpler. The consequence is that split_at
, which doesn't do the initial force, always returns at least one subiterator, even if the supplied argument is empty, whereas split_between
does not. I'd have to try both of these for some real problem in order to decide which interface I prefer; if you have a preference, by all means express it (but no guarantees about changes).
from collections import deque
def split_at(iterable, pred=lambda x:x is None):
'''Produces an iterator which returns successive sub-iterations of
`iterable`, delimited by values for which `pred` returns
truthiness. The default predicate returns True only for the
value None.
The sub-iterations share the underlying iterable, so they are not
independent of each other. Advancing the outer iterator will discard
the rest of the current sub-iteration.
The delimiting values are discarded.
'''
done = False
iterable = iter(iterable)
def subiter():
nonlocal done
for value in iterable:
if pred(value): return
yield value
done = True
while not done:
yield (g := subiter())
deque(g, maxlen=0)
def split_between(iterable, pred=lambda before,after:before 1 != after):
'''Produces an iterator which returns successive sub-iterations of
`iterable`, delimited at points where calling `pred` on two
consecutive values produces truthiness. The default predicate
returns True when the two values are not consecutive, making it
possible to split a sequence of integers into contiguous ranges.
The sub-iterations share the underlying iterable, so they are not
independent of each other. Advancing the outer iterator will discard
the rest of the current sub-iteration.
'''
iterable = iter(iterable)
try:
before = next(iterable)
except StopIteration:
return
done = False
def subiter():
nonlocal done, before
for after in iterable:
yield before
prev, before = before, after
if pred(prev, before):
return
yield before
done = True
while not done:
yield (g := subiter())
deque(g, maxlen=0)
Notes
collections.deque(g, maxlen=0)
is, I believe, currently the most efficient way of discarding the remaining values of an iterator, although it looks a bit mysterious. Credits tomore_itertools
for pointing me at that solution, and the related expression to count the number of objects produced by a generator:
Although I don't mean to blamecache[0][0] if (cache := deque(enumerate(it, 1), maxlen=1)) else 0
more_itertools
for the above monstrosity. (They do it with anif
statement, not a walrus.)