I have Pandas DataFrame like below (data types of "ID" and "COL1" is "object"):
ID | COL1 | COL2 | COL3
----|------|------|----
123 | ABc | 55 | G4
123 | Abc | 55 | G4
123 | DD | 55 | G4
44 | RoR | 41 | P0
44 | RoR | 41 | P0
55 | XX | 456 | RR
And I need to:
- Create new column "COL1_cum" where will be all values from "COL1" per ID separated by commas
- Drop duplicated IDs
- Create new column "COL1_num" where will be information how many different levels is in "COL1" per "ID"
So as a result I need something like below:
ID | COL1_cum | COL1_num |COL2 | COL3
----|----------|----------|-----|-----
123 | ABc, DD | 2 | 55 | G4
44 | RoR | 1 | 41 | P0
55 | XX | 1 | 456 | RR
Explanation for COL1_num:
- for ID = 123 COL1_num = 2 because for ID = 123 in "COL1" we have 2 different values: "ABc" and "DD"
- for ID = 44 COL1_num = 1 because for ID = 44 in "COL1" we have 1 value: "RoR"
- for ID = 55 COL1_num = 1 because for ID = 5 in "COL1" we have 1 value: "XX"
How can I do that in Python Pandas?
CodePudding user response:
If there are 2 columns in input data use DataFrame.drop_duplicates
with aggregate join
:
df1 = df.drop_duplicates().groupby('ID')['COL1'].agg(','.join).reset_index(name='COL1_cum')
If possible multiple columns is possible specify them:
df1 = (df.drop_duplicates(['ID','COL1'])
.groupby('ID')['COL1']
.agg(','.join)
.reset_index(name='COL1_cum'))
EDIT:
First remove duplciates per all columns:
df1 = df.drop_duplicates()
print (df1)
ID COL1 COL2 COL3
0 123 ABc 55 G4
2 123 DD 55 G4
3 44 RoR 41 P0
5 55 XX 456 RR
Then aggregate join
, size
and get first values per another columns (because same values per groups ID
):
df2 = (df1.groupby('ID', sort=False, as_index=False)
.agg(COL1_cum =('COL1',','.join),
COL1_num=('COL1','size'),
COL2=('COL2','first'),
COL3=('COL3','first')))
print (df2)
ID COL1_cum COL1_num COL2 COL3
0 123 ABc,DD 2 55 G4
1 44 RoR 1 41 P0
2 55 XX 1 456 RR
EDIT2: Real data are not duplicated by all columns, possible solution:
df2 = (df.groupby('ID', sort=False, as_index=False)
.agg(COL1_cum =('COL1',lambda x: ','.join(dict.fromkeys(x))),
COL1_num=('COL1','nunique'),
COL2=('COL2','first'),
COL3=('COL3','first')))
print (df2)
ID COL1_cum COL1_num COL2 COL3
0 123 ABc,DD 2 55 G4
1 44 RoR 1 41 P0
2 55 XX 1 456 RR