Home > Blockchain >  How to create column with values from column of duplicated rows separated by commas in DataFrame in
How to create column with values from column of duplicated rows separated by commas in DataFrame in

Time:09-12

I have Pandas DataFrame like below (data types of "ID" and "COL1" is "object"):

ID  | COL1 | COL2 | COL3
----|------|------|----
123 | ABc  | 55   | G4
123 | Abc  | 55   | G4
123 | DD   | 55   | G4
44  | RoR  | 41   | P0
44  | RoR  | 41   | P0
55  | XX   | 456  | RR

And I need to:

  1. Create new column "COL1_cum" where will be all values from "COL1" per ID separated by commas
  2. Drop duplicated IDs
  3. Create new column "COL1_num" where will be information how many different levels is in "COL1" per "ID"

So as a result I need something like below:

ID  | COL1_cum | COL1_num |COL2 | COL3
----|----------|----------|-----|-----
123 | ABc, DD  | 2        | 55  | G4
44  | RoR      | 1        | 41  | P0
55  | XX       | 1        | 456 | RR

Explanation for COL1_num:

  • for ID = 123 COL1_num = 2 because for ID = 123 in "COL1" we have 2 different values: "ABc" and "DD"
  • for ID = 44 COL1_num = 1 because for ID = 44 in "COL1" we have 1 value: "RoR"
  • for ID = 55 COL1_num = 1 because for ID = 5 in "COL1" we have 1 value: "XX"

How can I do that in Python Pandas?

CodePudding user response:

If there are 2 columns in input data use DataFrame.drop_duplicates with aggregate join:

df1 = df.drop_duplicates().groupby('ID')['COL1'].agg(','.join).reset_index(name='COL1_cum')

If possible multiple columns is possible specify them:

df1 = (df.drop_duplicates(['ID','COL1'])
         .groupby('ID')['COL1']
         .agg(','.join)
         .reset_index(name='COL1_cum'))

EDIT:

First remove duplciates per all columns:

df1 = df.drop_duplicates()
print (df1)
    ID COL1  COL2 COL3
0  123  ABc    55   G4
2  123   DD    55   G4
3   44  RoR    41   P0
5   55   XX   456   RR

Then aggregate join, size and get first values per another columns (because same values per groups ID):

df2 = (df1.groupby('ID', sort=False, as_index=False)
          .agg(COL1_cum =('COL1',','.join),
               COL1_num=('COL1','size'),
               COL2=('COL2','first'),
                COL3=('COL3','first')))
print (df2)
    ID COL1_cum  COL1_num  COL2 COL3
0  123   ABc,DD         2    55   G4
1   44      RoR         1    41   P0
2   55       XX         1   456   RR

EDIT2: Real data are not duplicated by all columns, possible solution:

df2 = (df.groupby('ID', sort=False, as_index=False)
          .agg(COL1_cum =('COL1',lambda x: ','.join(dict.fromkeys(x))),
               COL1_num=('COL1','nunique'),
               COL2=('COL2','first'),
                COL3=('COL3','first')))
print (df2)
    ID COL1_cum  COL1_num  COL2 COL3
0  123   ABc,DD         2    55   G4
1   44      RoR         1    41   P0
2   55       XX         1   456   RR
  • Related