There are questions that deal with this issue in:
Read CSV into a dataFrame with varying row lengths using Pandas
How to read a no header csv with variable length csv using pandas
However changes to Pandas seem to be deprecating those solutions. If I run this:
import pandas as pd
csv = r"""x,y,z,t
1,1,1,1
2,2,2,2
3,3,3,3,3
4,4,4,4
5,5,5,5
6,6,6,6"""
f = open('foo.csv', 'w')
f.write(csv)
f.close()
df = pd.read_csv('foo.csv', engine='python', usecols=[0,1,2,3,4], names = [0,1,2,3,4])
print(df)
I get
0 1 2 3 4
0 x y z t NaN
1 1 1 1 1 NaN
2 2 2 2 2 NaN
3 3 3 3 3 3.0
4 4 4 4 4 NaN
5 5 5 5 5 NaN
6 6 6 6 6 NaN
But I get a warning saying FutureWarning: Defining usecols with out of bounds indices is deprecated and will raise a ParserError in a future version.
I get the warning if my command is
df = pd.read_csv('foo.csv', engine='python', usecols=[4], names = [4])
as well.
I use some instruments (oscilloscopes, spectrum analyzers) that output CSVs in this odd form, where sometimes critical information for analysis scripts is by itself in the column like that. How will one be able to create this dataframe once pandas
deprecates this use?
CodePudding user response:
Here is one simple way to do it with the help of csv and itertools modules from Python standard library:
import csv
from itertools import zip_longest
import pandas as pd
with open("foo.csv") as f:
data = [row for row in csv.reader(f)]
df = pd.DataFrame(
{i: item for i, item in enumerate(zip_longest(*data, fillvalue=pd.NA))}
)
print(df)
# Output
0 1 2 3 4
0 x y z t <NA>
1 1 1 1 1 <NA>
2 2 2 2 2 <NA>
3 3 3 3 3 3
4 4 4 4 4 <NA>
5 5 5 5 5 <NA>
6 6 6 6 6 <NA>
CodePudding user response:
In short, remove the usecols
argument and you should be good to go!
>>> import pandas as pd
>>> df = pd.read_csv("foo.csv", names = [0, 1, 2, 3, 4])
>>> print(df)
0 1 2 3 4
0 x y z t NaN
1 1 1 1 1 NaN
2 2 2 2 2 NaN
3 3 3 3 3 3.0
4 4 4 4 4 NaN
5 5 5 5 5 NaN
6 6 6 6 6 NaN
As an insight into why, in the simplest case, we would read the CSV with pandas
with something like the following:
import pandas as pd
df = pd.read_csv("foo.csv")
However, for a CSV file containing the given data
x,y,z,t
1,1,1,1
2,2,2,2
3,3,3,3,3
4,4,4,4
5,5,5,5
6,6,6,6
An attempt to read this file outright with pandas
will produce the following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 4, saw 5
The reason for this is because pandas
is attempting to infer how many columns there are from the first row, which it is assuming is the CSV header. pandas
first infers that the CSV delimiter being used for this file is a comma, ","
, and then counts three such delimiters in the header row, from which it infers that there are four columns. It then proceeds to read the rest of the file line-by-line until it reaches line 4, where it suddenly sees four delimiters (five columns) instead of four and produces an error.
The usecols
solution circumvents this problem because it tells pandas
to expect five columns from the start instead of the inferred four. However, this is not the intended usage of usecols
: its purpose is to select which columns from the header to use by index. pandas
currently sees five columns because of what's given in the names
argument, overriding the inferred four columns, but it looks like in the future, that argument may populate the indices of the inferred header instead, thus the deprecation warning.
The key here is that the names
argument also tells pandas
how many columns to look for and what to call them, overriding whatever it would otherwise infer. If I give it a sixth name, it will add sixth column of NaN
values and just assume the delimiters are missing:
>>> pd.read_csv("foo.csv", names = [0, 1, 2, 3, 4, 5])
0 1 2 3 4 5
0 x y z t NaN NaN
1 1 1 1 1 NaN NaN
2 2 2 2 2 NaN NaN
3 3 3 3 3 3.0 NaN
4 4 4 4 4 NaN NaN
5 5 5 5 5 NaN NaN
6 6 6 6 6 NaN NaN
And if you wanted to skip the first row containing the default header, you could also throw in the skiprows = 1
argument:
>>> pd.read_csv("foo.csv", names = [0, 1, 2, 3, 4], skiprows = 1)
0 1 2 3 4
0 1 1 1 1 NaN
1 2 2 2 2 NaN
2 3 3 3 3 3.0
3 4 4 4 4 NaN
4 5 5 5 5 NaN
5 6 6 6 6 NaN