Home > Software engineering >  Count runs lazily in Python translate from Haskell
Count runs lazily in Python translate from Haskell

Time:10-23

I'm trying to write a generator function (or achieve the equivalent) which takes an iterable xs in Python and counts the "runs". (This is a problem in Thinking Functionally with Haskell by Bird, which I want to translate into Python using Python's laziness features.) So

list(iter(count_runs(['a', 'a', 'b', 'c', 'a', 'd', 'd'])))
# => [(2, 'a'), (1, 'b'), (1, c'), (1, 'a'), (2, 'd')]

In Haskell it's

countRuns :: [a] -> [(Int, a)]
countRuns [] = []
countRuns x:xs = (1   length us, x):countRuns vs
               where us, vs = span (==x) xs 

In Python, I'd like to write somethin like

from itertools import takewhile, dropwhile

def count_runs(xs):
  # get first element x of xs, if it exists
  us, vs = (takewhile(lambda y: y==x, xs), 
            dropwhile(lambda y: y==x, xs))
  yield (1   len(list(us)), x)
  yield from count_runs(vs)

But the problem is that vs is an iterator already, so I will run into trouble if I call takewhile and dropwhile on it in the next recursion. (When I call list(takewhile(..., xs)) in the next recursion, it will get rid of the first element of dropwhile(..., xs) as well, because they're both looking at the same iterator.

How can I fix this issue, and what is the correct way to get the first element in the second line?

CodePudding user response:

A significant difference between span and takewhile is that takewhile consumes the first non-x value in order to determine when to stop yielding values. As a result, you'll lose any singleton items in the input; in particular, takewhile loses the first b in producing the leading set of as. The iterator protocol has no way of peeking at the next element of an iterator, nor does it have a way to put back an element it consumes.

Instead, you'll need two independent iterators: one for takewhile to produce the desired prefix, and another from which to drop that prefix for the recursive call.

def count_runs(xs):
    try:
        x = next(xs)
    except StopIteration:
        return

    t1, t2 = tee(xs)

    us = list(takewhile(lambda y: y == x, t1))
    yield (1   len(us), x)
    yield from count_runs(dropwhile(lambda y: y == x, t2))

(Note that the itertools documentation implements something similar to span in its recipe section as a before_and_after function. It doesn't use tee, but I refer you to the actual implimentation for details.)

def before_and_after(xs):
    ...

def count_runs(xs):
    try:
        x = next(xs)
    except StopIteration:
        return

    first, second = before_and_after(lambda y: y == x, xs)
    yield (1   len(list(first)), x)
    yield from count_runs(second)

)

However, most of this work is already done for you by itertools.groupby.

def count_runs(xs):
    yield from ((len(list(v)), k) for k, v in groupby(xs))
  • Related