How to iterate over two sorted lists in largest pairs order in Python-CodePudding

I have two sorted iterables, e.g.:

a = [C, B, A]
b = [3, 2, 1]

I want to generate a list of all possible pairs, combining the largest (lowest index) pairs first. Largest means all combinations of element < 1 first (for both iterables), then all combinations of index < 2, etc. This is NOT cartesian product. Desired result:

combine(a, b)
>> ((C, 3), (B, 3), (C, 2), (B, 2), (A, 3), (A, 2), (C, 1), (B, 1), (A, 1))

I looked at itertools.product(), but this only generates the cartesian product.

Since the inputs are iterables (not necessarily lists), I'd prefer a library function that can handle iterables. The iterables may be long (millions of elements), so it would be desirable to avoid storing them all in memory. For the same reason, generating cartesian product and sorting is undesirable.

The output should be a generator, so that it's not necessary to iterate over all the combinations if not all are required.

CodePudding user response：

If you aren't too picky about the order, per the comments, and the goal is just lazy generation that considers the inputs "evenly": here is a simple approach for two inputs. Simply track the elements that have been "seen" so far from each input, while iterating over the two inputs in round-robin fashion (also see the recipes in the documentation). With each new element, yield the pairs that it makes with elements from the other input.

To make that work, though, we need to be aware of which source we are pulling from in the round-robin iteration. We could modify the implementation of roundrobin so that it yields that information in some encoded form; but I think it will be easier for our purposes to just write the whole algorithm directly, taking inspiration from there.

Thus:

def lazy_pairs(a, b):
    a_seen, b_seen = [], []
    a_active, b_active = True, True
    a_iter, b_iter = iter(a), iter(b)
    while a_active or b_active:
        if a_active:
            try:
                a_next = next(a_iter)
            except StopIteration:
                a_active = False
            else:
                for b_old in b_seen:
                    yield (a_next, b_old)
                a_seen.append(a_next)
        if b_active:
            try:
                b_next = next(b_iter)
            except StopIteration:
                b_active = False
            else:
                for a_old in a_seen:
                    yield (a_old, b_next)
                b_seen.append(b_next)

Generalizing for more inputs is left as an exercise. (Note that you could fall back on itertools.product to find all the possibilities involving the element just taken from one input, and seen elements from all the others.)

CodePudding user response：

Using the roundrobin recipe that Karl mentioned (copied verbatim). I think this will be faster, since all the hard work is done in C code of various itertools.

from itertools import repeat, chain, cycle, islice

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # Recipe credited to George Sakkis
    num_active = len(iterables)
    nexts = cycle(iter(it).__next__ for it in iterables)
    while num_active:
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            # Remove the iterator we just exhausted from the cycle.
            num_active -= 1
            nexts = cycle(islice(nexts, num_active))

def pairs(a, b):
    aseen = []
    bseen = []
    def agen():
        for aa in a:
            aseen.append(aa)
            yield zip(repeat(aa), bseen)
    def bgen():
        for bb in b:
            bseen.append(bb)
            yield zip(aseen, repeat(bb))
    return chain.from_iterable(roundrobin(agen(), bgen()))

a = ['C', 'B', 'A']
b = [3, 2, 1]
print(*pairs(a, b))

Output (Try it online!):

('C', 3) ('B', 3) ('C', 2) ('B', 2) ('A', 3) ('A', 2) ('C', 1) ('B', 1) ('A', 1)

Benchmark with two iterables of 2000 elements each (Try it online!):

 50 ms   50 ms   50 ms  Kelly
241 ms  241 ms  242 ms  Karl

Alternatively, if the two iterables can be iterated multiple times, we don't need to save what we've seen (Try it online!):

def pairs(a, b):
    def agen():
        for i, x in enumerate(a):
            yield zip(repeat(x), islice(b, i))
    def bgen():
        for i, x in enumerate(b, 1):
            yield zip(islice(a, i), repeat(x))
    return chain.from_iterable(roundrobin(agen(), bgen()))

(Will add to the benchmark later... Should be a little slower than my first solution.)

An extreme map/itertools version of that (Try it online!):

def pairs(a, b):
    return chain.from_iterable(roundrobin(
        map(zip,
            map(repeat, a),
            map(islice, repeat(b), count())),
        map(zip,
            map(islice, repeat(a), count(1)),
            map(repeat, b))
    ))