I'm reading a data set where columns we're added to the historical files and I want to read the files consistently. The problem is the older files is missing a column don't have the correct number of tabs in the header which leads the first column being read as the index.
The bad.csv
Col1 Col2 Col3 Col4 Col5
6 2 3
5 2 4
A good.csv
that loads correctly
Col1 Col2 Col3 Col4 Col5 Col6
6 2 3
5 2 4
I'm reading the csvs file df = pd.read_csv('bad.csv', sep='\t')
I can detect if a file is bad by looking at the index, how can i correct the bad files so that it loads without Col1 being part of the index? I tried df.shift(1, axis=1)
but this doesn't include the index, which I can set after shifting but I'm worried this might create more issues. ex:
df = df.shift(1,axis=1)
df.Col1 = df.index
Is there a better way?
CodePudding user response:
According to the docs:
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
So I make sure that every line is ended with \t
bad.csv:
col1 col2 col3 col4 col5
2 4 6 8 10
3 5 8 10 13
4 8 12 16 20 24
15 13 11 9 7 5
1 1 2 3 5 8
Then:
df = pd.read_csv('bad.csv', sep='\t', index_col=False)
outcome
col1 col2 col3 col4 col5 Unnamed: 5
0 2 4 6 8 10 NaN
1 3 5 8 10 13 NaN
2 4 8 12 16 20 24.0
3 15 13 11 9 7 5.0
4 1 1 2 3 5 8.0