I am trying to read a dataframe from a CSV file where the header with column names includes a name on the index column.
Sample Data Code i've tried so far:
# Open CSV
file= '507_002117.csv'
print(file)
df = pd.read_csv(file, header=6)
df = df.iloc[:, :11]
#print(df.dtypes)
print(df.head(5))
print(df['AlongTrack'])
I return something like this:
Timestamp AlongTrack ... Pitch Roll
2022-03-31 21:38:30 0.000000 0.000000 ... -0.260261 0.93
2022-03-31 21:54:48 0.098491 -0.000042 ... -0.259386 0.93
2022-03-31 21:54:53 0.196983 -0.000086 ... -0.249085 0.93
2022-03-31 21:54:59 0.295475 -0.000117 ... -0.243554 0.92
2022-03-31 21:55:04 0.393967 -0.000172 ... -0.246343 0.92
as you can see the index is properly set as the timestamp, but the column names are all shifted over one.
I've tried using the documentation, Pandas Documentation on read_csv
It looks like index_col would be the ticket, but I can't seem to get it to work setting it at various values, but all it does is shift things over by one row for the index, but the labels are still wrong.
Can anyone point me in the right direction? Thank you
CodePudding user response:
I was able to recreate this situation by creating a CSV whose column name count is one less than the number of comma-delimited data columns. Maybe that is your situation. Like this:
sim_csv = io.StringIO('''Timestamp,AlongTrack,Pitch,Roll
2022-03-31 21:38:30,0.000000, 0.000000,-0.260261,0.93
2022-03-31 21:54:48,0.098491,-0.000042,-0.259386,0.93
2022-03-31 21:54:53,0.196983,-0.000086,-0.249085,0.93
2022-03-31 21:54:59,0.295475,-0.000117,-0.243554,0.92
2022-03-31 21:55:04,0.393967,-0.000172,-0.246343,0.92''')
df = pd.read_csv(sim_csv)
print(df)
Timestamp AlongTrack Pitch Roll
2022-03-31 21:38:30 0.000000 0.000000 -0.260261 0.93
2022-03-31 21:54:48 0.098491 -0.000042 -0.259386 0.93
2022-03-31 21:54:53 0.196983 -0.000086 -0.249085 0.93
2022-03-31 21:54:59 0.295475 -0.000117 -0.243554 0.92
2022-03-31 21:55:04 0.393967 -0.000172 -0.246343 0.92
With the correct number fo column headers we are good-to-go:
sim_csv = io.StringIO('''Timestamp,AlongTrack,Pitch,Roll,xtra
2022-03-31 21:38:30,0.000000, 0.000000,-0.260261,0.93
2022-03-31 21:54:48,0.098491,-0.000042,-0.259386,0.93
2022-03-31 21:54:53,0.196983,-0.000086,-0.249085,0.93
2022-03-31 21:54:59,0.295475,-0.000117,-0.243554,0.92
2022-03-31 21:55:04,0.393967,-0.000172,-0.246343,0.92''')
df = pd.read_csv(sim_csv)
print(df)
Timestamp AlongTrack Pitch Roll xtra
0 2022-03-31 21:38:30 0.000000 0.000000 -0.260261 0.93
1 2022-03-31 21:54:48 0.098491 -0.000042 -0.259386 0.93
2 2022-03-31 21:54:53 0.196983 -0.000086 -0.249085 0.93
3 2022-03-31 21:54:59 0.295475 -0.000117 -0.243554 0.92
4 2022-03-31 21:55:04 0.393967 -0.000172 -0.246343 0.92
Looks like the logic is that if the CSV is one column header short then it is assumed that the first column is an unnamed index column.
One way around this would be to manually specify names=
with your intended column names including the one that is missing. Then use skiprows=
to skip over the header to the data.
CodePudding user response:
Just add the following line of code:
df = pd.read_csv(sim_csv)
df = df.set_index(df.columns[0]).reset_index()
df
Result:
index | Timestamp | AlongTrack | Pitch | Roll |
---|---|---|---|---|
0 | 0.0 | 0.0 | -0.260261 | 0.93 |
1 | 0.098491 | -4.2e-05 | -0.259386 | 0.93 |
2 | 0.196983 | -8.6e-05 | -0.249085 | 0.93 |
3 | 0.295475 | -0.000117 | -0.243554 | 0.92 |
4 | 0.393967 | -0.000172 | -0.246343 | 0.92 |
I've spent some of my free time on it, so please accept ✅ and ▲ this answer if it solved your problem, it motivates me