I have a raw dataframe that looks like
codcet placa_encoded date time_seconds velocidade
5031 490191222 431.0 2021-03-11 70079.0 51
5032 490221211 431.0 2021-03-11 72219.0 55
7991 490361213 562.0 2021-03-11 28559.0 24
7992 490361232 562.0 2021-03-11 29102.0 29
7993 490361221 562.0 2021-03-11 30183.0 33
...
Where the numbers on the far left are indexes from the original dataset.
My goal is to convert this into a dataframe indexed by placa_encoded
and by n
, a counter within each group that then looks like
placa_encoded n time_seconds velocidade codcet
431.0 0 70079.0 51 490191222
431.0 1 72219.0 55 490221211
562.0 0 28559.0 24 490361213
562.0 1 29102.0 29 490361232
562.0 2 30183.0 33 490361221
That is, I aim to groupby('placa_encoded')
then add another column n
that counts the position within each group. The row should be indexed by both placa_encoded
and n
. I think I can use cumcount()
to do this but it's unclear to me how to add it as a column since groupby
doesn't product a dataframe I can assign to. I looked at this question but it seems they use .count()
to convert it to a dataframe, and I want to preserve the data instead of getting any counts. I also tried to use pd.DataFrame(gbplaca)
and pd.DataFrame(gbplaca.groups)
to no avail.
Thank you so much!
CodePudding user response:
I think this is what you are wanting
df['n'] = df.sort_values(['time_seconds'], ascending=True).groupby(['placa_encoded']).cumcount()
df = df.set_index(['placa_encoded', 'n'])
df
This is a multi-index with 'placa_encoded' and 'n'
CodePudding user response:
All you have to do is to set the column in df
(named 'n'
) with the result from df.groupby('placa_encoded').cumcount()
:
df.loc[:, 'n'] = df.groupby('placa_encoded').cumcount()
print(df)
codcet placa_encoded date time_seconds velocidade n
0 490191222 431.0 2021-03-11 70079.0 51 0
1 490221211 431.0 2021-03-11 72219.0 55 1
2 490361213 562.0 2021-03-11 28559.0 24 0
3 490361232 562.0 2021-03-11 29102.0 29 1
4 490361221 562.0 2021-03-11 30183.0 33 2
Then, you can set the multi-index of df
to ['placa_encoded', 'n']
if you want:
df = df.set_index(['placa_encoded', 'index'])
print(df)
codcet date time_seconds velocidade
placa_encoded n
431.0 0 490191222 2021-03-11 70079.0 51
1 490221211 2021-03-11 72219.0 55
562.0 0 490361213 2021-03-11 28559.0 24
1 490361232 2021-03-11 29102.0 29
2 490361221 2021-03-11 30183.0 33