Home > Mobile >  How to organize fields in dataframe by repetition and drop duplicates
How to organize fields in dataframe by repetition and drop duplicates

Time:10-23

I have this

id phone1 phone2 
1  300    301
1  303    300
1  300    303
2  400    401

Want this

id phone1 phone2 phone3
1  300    303    301
2  400    401

I have tried group by id and column phone1, apply count function, iterate over it adding to a list verifying if is already there the id and phone and sum the third column, and do the same thing with phone2 in the same list

After it reorganize the dataframe iterating the list but this is so slow with the millions of data that i have to proccess

dataframe1 = dataframe.groupby(['id', 'phone1']).count().reset_index()
dataframe2 = dataframe.groupby(['id', 'phone2']).count().reset_index()

result to add in a list

id phone1 phone2
1  300    2    
1  303    1
2  401    1

id phone1 phone2
1  300    1   
1  301    1
1  303    1
2  400    1

CodePudding user response:

Iterate a dataframe is so slow and not recomended

You can group the phones and apply the list function for each id, after it organize by duplicates and split in new columns

CodePudding user response:

You can melt to reshape the phone columns to rows, then remove the duplicates per group. Finally, pivot to reshape back to wide format.

out = (df
   .melt('id')
   .drop_duplicates(['id', 'value'])
   .assign(col=lambda d: d.groupby('id').cumcount().add(1))
   .pivot_table(index='id', columns='col', values='value', fill_value=pd.NA)
   .astype('Int64') # optional
   .add_prefix('phone')
   .rename_axis(columns=None).reset_index()
)

output:

   id  phone1  phone2  phone3
0   1     300     303     301
1   2     400     401    <NA>
  • Related