I have DataFrame in Python Pandas like below (ID and COL1 is "object" data type):
ID | COL1 |
---|---|
123 | XX |
123 | XX |
123 | L |
456 | AP |
456 | AP |
77 | C |
88 | LACK |
66 | LACK |
66 | G |
And I need to create new column "COL1_bis" where will be infromation how many unique values in "COL1" has each ID. But if some ID has value "LACK" in "COL1" then we do not count that
So as a result I need something like below:
ID | COL1 | COL1_bis |
---|---|---|
123 | XX | 2 |
123 | XX | 2 |
123 | L | 2 |
456 | AP | 1 |
456 | AP | 1 |
77 | C | 1 |
88 | LACK | 0 |
66 | LACK | 1 |
66 | GG | 1 |
Because:
- ID = 123 has 2 in COL1_bis, because this ID has 2 different (unique) values in COL1: "XX" and "L"
- ID = 456 has 1 in COL1_bis, because this ID has 1 unique value in COL1: "AP"
- ID = 77 has 1 in COL1_bis, because this ID has 1 unique value in COL1: "C"
- ID = 88 has 0 in COL1_bis, because this ID has value "LACK" in COL1, so we do not count this value
- ID = 66 has 1 in COL1_bis, because this ID has "LACK" in COL1 and "GG" so we count only "GG"
How can I do that in Python Pandas?
CodePudding user response:
Use Series.mask
for replace LACK
to missing values and then use GroupBy.transform
with DataFrameGroupBy.nunique
:
df['COL1_bis'] = (df['COL1'].mask(df['COL1'].eq('LACK'))
.groupby(df['ID'])
.transform('nunique'))
print (df)
ID COL1 COL1_bis
0 123 XX 2
1 123 XX 2
2 123 L 2
3 456 AP 1
4 456 AP 1
5 77 C 1
6 88 LACK 0
7 66 LACK 1
8 66 G 1