I am trying to read an unstructured CSV file without any header, using Pandas. The number of columns differ in different rows and there is no clear upper limit for the num of columns. Right now it is 10 but it will increase to maybe 15.
Example CSV file content:
a;b;c
a;b;c;d;e;;;f
a;;
a;b;c;d;e;f;g;h;;i
a;b;
....
Here is how I read it using Python Pandas:
pd.DataFrame(pd.read_csv(path, sep=";", header=None, usecols=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
names=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'],
nrows=num_of_rows 1))
However this produces FutureWarning: Defining usecols with out of bounds indices is deprecated and will raise a ParserError in a future version.
warning message. And I don't want my code to stop working in the future because of this reason.
My question is that is there a way to read such an unstructured CSV file using Pandas (or any other equivalently fast library) in a future-safe way?
CodePudding user response:
You can use:
# choose a bad seperator
df = (pd.read_csv('data.csv', sep='@', header=None).squeeze()
.str.split(';', expand=True).fillna(''))
df.columns = [chr(65 c) for c in df.columns] # or whatever you want
print(df)
# Output
A B C D E F G H I J
0 a b c
1 a b c d e f
2 a
3 a b c d e f g h i
4 a b
Update
Other possibility:
df = (pd.read_csv('data.csv', sep='@', header=None).squeeze()
.str.replace(r';{2,}', ';')
.str.split(';', expand=True).fillna(''))
df.columns = [chr(65 c) for c in df.columns] # or whatever you want
print(df)
# Output
A B C D E F G H I
0 a b c
1 a b c d e f
2 a
3 a b c d e f g h i
4 a b