Home > Enterprise >  Pandas index column name issue when reading csv
Pandas index column name issue when reading csv

Time:06-02

I am trying to read a dataframe from a CSV file where the header with column names includes a name on the index column.

Sample Data Code i've tried so far:

# Open CSV
file= '507_002117.csv'
print(file)
df = pd.read_csv(file, header=6)
df = df.iloc[:, :11]
#print(df.dtypes)
print(df.head(5))

print(df['AlongTrack'])

I return something like this:

                     Timestamp  AlongTrack  ...     Pitch  Roll
2022-03-31 21:38:30   0.000000    0.000000  ... -0.260261  0.93
2022-03-31 21:54:48   0.098491   -0.000042  ... -0.259386  0.93
2022-03-31 21:54:53   0.196983   -0.000086  ... -0.249085  0.93
2022-03-31 21:54:59   0.295475   -0.000117  ... -0.243554  0.92
2022-03-31 21:55:04   0.393967   -0.000172  ... -0.246343  0.92

as you can see the index is properly set as the timestamp, but the column names are all shifted over one.

I've tried using the documentation, Pandas Documentation on read_csv

It looks like index_col would be the ticket, but I can't seem to get it to work setting it at various values, but all it does is shift things over by one row for the index, but the labels are still wrong.

Can anyone point me in the right direction? Thank you

CodePudding user response:

I was able to recreate this situation by creating a CSV whose column name count is one less than the number of comma-delimited data columns. Maybe that is your situation. Like this:

sim_csv = io.StringIO('''Timestamp,AlongTrack,Pitch,Roll
2022-03-31 21:38:30,0.000000, 0.000000,-0.260261,0.93
2022-03-31 21:54:48,0.098491,-0.000042,-0.259386,0.93
2022-03-31 21:54:53,0.196983,-0.000086,-0.249085,0.93
2022-03-31 21:54:59,0.295475,-0.000117,-0.243554,0.92
2022-03-31 21:55:04,0.393967,-0.000172,-0.246343,0.92''')

df = pd.read_csv(sim_csv)
print(df)

                     Timestamp  AlongTrack     Pitch  Roll
2022-03-31 21:38:30   0.000000    0.000000 -0.260261  0.93
2022-03-31 21:54:48   0.098491   -0.000042 -0.259386  0.93
2022-03-31 21:54:53   0.196983   -0.000086 -0.249085  0.93
2022-03-31 21:54:59   0.295475   -0.000117 -0.243554  0.92
2022-03-31 21:55:04   0.393967   -0.000172 -0.246343  0.92

With the correct number fo column headers we are good-to-go:

sim_csv = io.StringIO('''Timestamp,AlongTrack,Pitch,Roll,xtra
2022-03-31 21:38:30,0.000000, 0.000000,-0.260261,0.93
2022-03-31 21:54:48,0.098491,-0.000042,-0.259386,0.93
2022-03-31 21:54:53,0.196983,-0.000086,-0.249085,0.93
2022-03-31 21:54:59,0.295475,-0.000117,-0.243554,0.92
2022-03-31 21:55:04,0.393967,-0.000172,-0.246343,0.92''')

df = pd.read_csv(sim_csv)
print(df)

             Timestamp  AlongTrack     Pitch      Roll  xtra
0  2022-03-31 21:38:30    0.000000  0.000000 -0.260261  0.93
1  2022-03-31 21:54:48    0.098491 -0.000042 -0.259386  0.93
2  2022-03-31 21:54:53    0.196983 -0.000086 -0.249085  0.93
3  2022-03-31 21:54:59    0.295475 -0.000117 -0.243554  0.92
4  2022-03-31 21:55:04    0.393967 -0.000172 -0.246343  0.92

Looks like the logic is that if the CSV is one column header short then it is assumed that the first column is an unnamed index column.

One way around this would be to manually specify names= with your intended column names including the one that is missing. Then use skiprows= to skip over the header to the data.

CodePudding user response:

Just add the following line of code:

df = pd.read_csv(sim_csv)
df = df.set_index(df.columns[0]).reset_index()
df

Result:

index Timestamp AlongTrack Pitch Roll
0 0.0 0.0 -0.260261 0.93
1 0.098491 -4.2e-05 -0.259386 0.93
2 0.196983 -8.6e-05 -0.249085 0.93
3 0.295475 -0.000117 -0.243554 0.92
4 0.393967 -0.000172 -0.246343 0.92

I've spent some of my free time on it, so please accept ✅ and ▲ this answer if it solved your problem, it motivates me

  • Related