Home > Back-end >  Pandas reading file with inconsistent tabs in header
Pandas reading file with inconsistent tabs in header

Time:05-28

I'm reading a data set where columns we're added to the historical files and I want to read the files consistently. The problem is the older files is missing a column don't have the correct number of tabs in the header which leads the first column being read as the index.

The bad.csv

Col1    Col2    Col3    Col4    Col5
6   2   3           
5   2   4           

A good.csv that loads correctly

Col1    Col2    Col3    Col4    Col5    Col6
6   2   3           
5   2   4           

I'm reading the csvs file df = pd.read_csv('bad.csv', sep='\t')

I can detect if a file is bad by looking at the index, how can i correct the bad files so that it loads without Col1 being part of the index? I tried df.shift(1, axis=1) but this doesn't include the index, which I can set after shifting but I'm worried this might create more issues. ex:

df = df.shift(1,axis=1)
df.Col1 = df.index

Is there a better way?

CodePudding user response:

According to the docs:

Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

So I make sure that every line is ended with \t

bad.csv:

col1    col2    col3    col4    col5
2   4   6   8   10
3   5   8   10  13
4   8   12  16  20  24
15  13  11  9   7   5
1   1   2   3   5   8

Then:

df = pd.read_csv('bad.csv', sep='\t', index_col=False)

outcome

   col1  col2  col3  col4  col5  Unnamed: 5
0     2     4     6     8    10         NaN
1     3     5     8    10    13         NaN
2     4     8    12    16    20        24.0
3    15    13    11     9     7         5.0
4     1     1     2     3     5         8.0
  • Related