Home > Software design >  Collapsing a multi-index df into a single row and column separated by a comma
Collapsing a multi-index df into a single row and column separated by a comma

Time:09-29

I have a df that looks like this:

df.set_index(['pid','vid]).sort_values(by = 'time')

Before:

                          pid    time 

    id           vid      
    
    id1         vis_id1    pid1    t_0      
                vis_id1    pid2    t_1
                vis_id1    pid1    t_2
                vis_id1    pid2    t_3
                vis_id1    pid1    t_4       
    
    id2         vis_id2    pid1    t_3      
                vis_id2    pid2    t_4      
                vis_id2    pid2    t_5      
                vis_id2    pid2    t_6
                vis_id2    pid2    t_7 

I want to collapse all of the pid such that for every id we have the following df

                                  pid       
id           vid      

id1         vis_id1    pid1, pid2, pid1, pid2, pid1               

id2         vis_id2    pid1, pid2, pid2, pid2, pid2      


  

I have tried transposing the data first and then applying [steps].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1) but if I can avoid the transposition all together that would save me a lot of computing power

CodePudding user response:

You can .groupby .agg() with aggregation with ', '.join, as follows:

df.groupby(['id', 'vid'])[['pid']].agg(', '.join)

Note the double square bracket [[...]] used around pid in [['pid']]

Result:

                                      pid
id  vid                                  
id1 vis_id1  pid1, pid2, pid1, pid2, pid1
id2 vis_id2  pid1, pid2, pid2, pid2, pid2

CodePudding user response:

Use pivot_table:

out = df.pivot_table(index=['id', 'vid'], values='pid', aggfunc=', '.join)
print(out)

# Output:
                                      pid
id  vid                                  
id1 vis_id1  pid1, pid2, pid1, pid2, pid1
id2 vis_id2  pid1, pid2, pid2, pid2, pid2
  • Related