Home > Software engineering >  Pandas - Read in CSV with variable row length
Pandas - Read in CSV with variable row length

Time:03-13

There are questions that deal with this issue in:

Read CSV into a dataFrame with varying row lengths using Pandas

How to read a no header csv with variable length csv using pandas

However changes to Pandas seem to be deprecating those solutions. If I run this:

import pandas as pd

csv = r"""x,y,z,t
  1,1,1,1
  2,2,2,2
  3,3,3,3,3
  4,4,4,4
  5,5,5,5
  6,6,6,6"""

f = open('foo.csv', 'w')
f.write(csv)
f.close()

df = pd.read_csv('foo.csv', engine='python', usecols=[0,1,2,3,4], names = [0,1,2,3,4])

print(df)

I get

      0  1  2  3    4
0     x  y  z  t  NaN
1     1  1  1  1  NaN
2     2  2  2  2  NaN
3     3  3  3  3  3.0
4     4  4  4  4  NaN
5     5  5  5  5  NaN
6     6  6  6  6  NaN

But I get a warning saying FutureWarning: Defining usecols with out of bounds indices is deprecated and will raise a ParserError in a future version. I get the warning if my command is

df = pd.read_csv('foo.csv', engine='python', usecols=[4], names = [4]) as well.

I use some instruments (oscilloscopes, spectrum analyzers) that output CSVs in this odd form, where sometimes critical information for analysis scripts is by itself in the column like that. How will one be able to create this dataframe once pandas deprecates this use?

CodePudding user response:

Here is one simple way to do it with the help of csv and itertools modules from Python standard library:

import csv
from itertools import zip_longest

import pandas as pd


with open("foo.csv") as f:
    data = [row for row in csv.reader(f)]

df = pd.DataFrame(
    {i: item for i, item in enumerate(zip_longest(*data, fillvalue=pd.NA))}
)

print(df)
# Output
     0  1  2  3     4
0    x  y  z  t  <NA>
1    1  1  1  1  <NA>
2    2  2  2  2  <NA>
3    3  3  3  3     3
4    4  4  4  4  <NA>
5    5  5  5  5  <NA>
6    6  6  6  6  <NA>

CodePudding user response:

In short, remove the usecols argument and you should be good to go!

>>> import pandas as pd
>>> df = pd.read_csv("foo.csv", names = [0, 1, 2, 3, 4])
>>> print(df)
   0  1  2  3    4
0  x  y  z  t  NaN
1  1  1  1  1  NaN
2  2  2  2  2  NaN
3  3  3  3  3  3.0
4  4  4  4  4  NaN
5  5  5  5  5  NaN
6  6  6  6  6  NaN

As an insight into why, in the simplest case, we would read the CSV with pandas with something like the following:

import pandas as pd

df = pd.read_csv("foo.csv")

However, for a CSV file containing the given data

x,y,z,t
1,1,1,1
2,2,2,2
3,3,3,3,3
4,4,4,4
5,5,5,5
6,6,6,6

An attempt to read this file outright with pandas will produce the following error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 4, saw 5

The reason for this is because pandas is attempting to infer how many columns there are from the first row, which it is assuming is the CSV header. pandas first infers that the CSV delimiter being used for this file is a comma, ",", and then counts three such delimiters in the header row, from which it infers that there are four columns. It then proceeds to read the rest of the file line-by-line until it reaches line 4, where it suddenly sees four delimiters (five columns) instead of four and produces an error.

The usecols solution circumvents this problem because it tells pandas to expect five columns from the start instead of the inferred four. However, this is not the intended usage of usecols: its purpose is to select which columns from the header to use by index. pandas currently sees five columns because of what's given in the names argument, overriding the inferred four columns, but it looks like in the future, that argument may populate the indices of the inferred header instead, thus the deprecation warning.

The key here is that the names argument also tells pandas how many columns to look for and what to call them, overriding whatever it would otherwise infer. If I give it a sixth name, it will add sixth column of NaN values and just assume the delimiters are missing:

>>> pd.read_csv("foo.csv", names = [0, 1, 2, 3, 4, 5])
   0  1  2  3    4   5
0  x  y  z  t  NaN NaN
1  1  1  1  1  NaN NaN
2  2  2  2  2  NaN NaN
3  3  3  3  3  3.0 NaN
4  4  4  4  4  NaN NaN
5  5  5  5  5  NaN NaN
6  6  6  6  6  NaN NaN

And if you wanted to skip the first row containing the default header, you could also throw in the skiprows = 1 argument:

>>> pd.read_csv("foo.csv", names = [0, 1, 2, 3, 4], skiprows = 1)
   0  1  2  3    4
0  1  1  1  1  NaN
1  2  2  2  2  NaN
2  3  3  3  3  3.0
3  4  4  4  4  NaN
4  5  5  5  5  NaN
5  6  6  6  6  NaN
  • Related