Why do I inconsistently get a ValueError or IndexError when splitting a string and using the results-CodePudding

I have some code that processes some input text by splitting it up:

text = get_data_from_internet() # or read it from a file, whatever
a, b, c = text.split('|')

Usually, this works fine, but occasionally I will get an error message that looks like

ValueError: not enough values to unpack (expected 3, got 1)

If I instead try to get a single result from the split, like so:

first = text.split()[0]

then similarly it seems to work sometimes, but other times I get

IndexError: list index out of range

What is going on? I assume it has something to do with the data, but how can I understand the problem and fix it?

_{This question is intended as a canonical for common debugging questions. It is meant to explain primarily what the error message means and specifically what about the input string causes the problem. Questions like this are usually not caused by a typo; they are asked by people who need something explained.}

CodePudding user response：

The problem is that the result from .split does not have enough items in it. .split will produce a list of strings, depending on the input; the length of that list depends on the input string, and not on any surrounding code.

When you write a, b, c = text.split('|'), the .split method does not know that three values are expected; it gives the appropriate number of results depending on how many |s there are, and then an error occurs. In this case, the error is ValueError, because there is something wrong with the value: the list doesn't have enough items.

When you write first = text.split()[0], there may be a problem with the index (the [0] in that code), causing an IndexError. The cause is the same: the list doesn't have enough items. (For an empty list, even [0] cannot be used as an index.) See also: Does "IndexError: list index out of range" when trying to access the N'th item mean that my list has less than N items?.

As a special note: when .split is called with a delimiter (like the first example), the resulting list will have at least one item in it. This happens even if the input is an empty string:

>>> ''.split(',')
['']

However, this is not the case when using .split without a delimiter (special-case behaviour to split on any sequence of whitespace):

>>> ''.split()
[]

To solve the problem:

Make sure you are using the right tools for the job. For example, if your input is a .csv file, you should not try to write the code yourself to split the data into cells. Instead, use the csv standard library module.
Carefully check the input to figure out where the problem is occurring. Code like this is usually inside a loop that handles a line of text at a time; check what lines appear in the data to cause the problem. You can do this by, for example, using exception handling and logging. See https://ericlippert.com/2014/03/05/how-to-debug-small-programs/ for more general advice on debugging problems.
Decide what should happen when the bad input appears. Often, it is appropriate to just skip a blank line in the input. Sometimes you can fill in dummy values for whatever is missing. Sometimes you should report your own exception, etc. It is up to you to think about the specific context of your code and decide on the right course of action.

CodePudding user response：

Here are some techniques for solving the problem:

Avoiding `IndexError`

We can manually check that the output of .split is of the correct length:

### here we want the third item
out = 'a,b'.split(',')
# ['a', 'b']

# we want the third item
n = 2

if len(out) > n:
    out = out[n]
else:
    out = None

# checking we have None
out == None
# True

If you specifically want the first item of a list, but the list might be empty, you can use this trick:

out = ''.split()
# []

# get the first item or None
next(iter(out), None)
# None

See Python idiom to return first item or None for details.

Avoiding `ValueError`

Since we don't know in advance how many items will be returned by split, here is a way to pad them with a default value (None, in this example) and ignore any extra values:

from itertools import chain, repeat

a,b,c,d,*_ = chain('A,B'.split(','), repeat(None, 4))

print(a, b, c, d)
# A B None None

We use itertools.repeat to create as many None values as might be needed, and itertools.chain to append them to the .split result.

The extended unpacking technique absorbs any extra Nones into the _ variable (not a special name, just convention). This also ignores extra values from the input. For example:

a, b, c, *_ = 'A,B,C,D,E'.split(',')

print(a, b, c)
# A B C

We can use a variation of that technique to get only the last three elements:

*_, x, y, z = 'A,B,C,D,E'.split(',')

print(x, y, z)
# C D E

Or a combination of both leading and trailing items (note that only one starred value is allowed in the expression):

a, b, *_, z = 'A,B,C,D,E'.split(',')

print(a, b, z)
# A B E

Avoiding IndexError

Avoiding ValueError

Avoiding `IndexError`

Avoiding `ValueError`