Home > database >  python `zip` builtin behavior unclear in documentation
python `zip` builtin behavior unclear in documentation

Time:10-14

I'm well aware of the warning that's provided in Python's documentation for zip:

One thing to consider is that the iterables passed to zip() could have different lengths; sometimes by design, and sometimes because of a bug in the code that prepared these iterables. Python offers three different approaches to dealing with this issue:

    By default, zip() stops when the shortest iterable is exhausted. It will ignore the remaining items in the longer iterables, cutting off the result to the length of the shortest iterable:
    >>>

    >>> list(zip(range(3), ['fee', 'fi', 'fo', 'fum']))
    [(0, 'fee'), (1, 'fi'), (2, 'fo')]

As well as the suggested use of the strict argument to complain in the event the supplied iterables aren't of equal length.

There's even an explicit warning of the perils of different length iterators towards the end of zip:

Without the strict=True argument, any bug that results in iterables of different lengths will be silenced, possibly manifesting as a hard-to-find bug in another part of the program.

What wasn't clear to me from reading the docs though was that zip's behavior with regard to iterable consumption is also dependent on the order in which the iterables are supplied. Or to rephrase, you risk losing elements from longer iterables if they are supplied first to zip.

Consider the following:

    r = range(3)
    it = iter(r)
    list(zip('12', it))
    print(f'list(it) => {list(it)} <= happy case, this is expected')
    it = iter(r)
    list(zip(it, '12'))
    print(f'list(it) => {list(it)} <= sad case, an element is consumed from the iterator')

Output:

list(it) => [2] <= happy case, this is expected
list(it) => [] <= sad case, an element is consumed from the iterator

This behavior makes sense when I thought about it for a bit. Within Cpython's source there is a loop in zip_next which consumes an element from each of the supplied arguments. If the shorter iterable happens to be empty, the elements that were consumed already are effectively discarded.

        for (i=0 ; i < tuplesize ; i  ) {
            it = PyTuple_GET_ITEM(lz->ittuple, i);
            item = (*Py_TYPE(it)->tp_iternext)(it);
            if (item == NULL) {
                Py_DECREF(result);
                if (lz->strict) {
                    goto check;
                }
                return NULL;
            }
            olditem = PyTuple_GET_ITEM(result, i);
            PyTuple_SET_ITEM(result, i, item);
            Py_DECREF(olditem);
        }

I guess the main piece of the documentation that I take exception with is this:

It will ignore the remaining items in the longer iterables, cutting off the result to the length of the shortest iterable

The only way this could be true is if zip has some internal mechanism for peeking at the iterables which isn't the case.

So the question is should this be considered a bug in zip's:

  1. behavior
  2. documentation
  3. neither, you were already warned, but it would be nice for this case to be included in the docs
  4. go home, you're being too pedantic

CodePudding user response:

Indeed -this is (3) and (4). Besides, it actually works as expected: one does not expect magic, but expects that zip will internally fetch elements from the iterators in the order they are given.

  • Related