I have a df that looks like this:
df.set_index(['pid','vid]).sort_values(by = 'time')
Before:
pid time
id vid
id1 vis_id1 pid1 t_0
vis_id1 pid2 t_1
vis_id1 pid1 t_2
vis_id1 pid2 t_3
vis_id1 pid1 t_4
id2 vis_id2 pid1 t_3
vis_id2 pid2 t_4
vis_id2 pid2 t_5
vis_id2 pid2 t_6
vis_id2 pid2 t_7
I want to collapse all of the pid
such that for every id
we have the following df
pid
id vid
id1 vis_id1 pid1, pid2, pid1, pid2, pid1
id2 vis_id2 pid1, pid2, pid2, pid2, pid2
I have tried transposing the data first and then applying [steps].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
but if I can avoid the transposition all together that would save me a lot of computing power
CodePudding user response:
You can .groupby
.agg()
with aggregation with ', '.join
, as follows:
df.groupby(['id', 'vid'])[['pid']].agg(', '.join)
Note the double square bracket [[...]]
used around pid
in [['pid']]
Result:
pid
id vid
id1 vis_id1 pid1, pid2, pid1, pid2, pid1
id2 vis_id2 pid1, pid2, pid2, pid2, pid2
CodePudding user response:
Use pivot_table
:
out = df.pivot_table(index=['id', 'vid'], values='pid', aggfunc=', '.join)
print(out)
# Output:
pid
id vid
id1 vis_id1 pid1, pid2, pid1, pid2, pid1
id2 vis_id2 pid1, pid2, pid2, pid2, pid2