I have a df of customers
CUST_ID | SEGMENT | AREA
1 | B | CAD
1 | A | RAM
2 | B | CAD
2 | C | RAM
3 | B | RAM
4 | A | RAM
I want to count the unique number of CUST_ID
per SEGMENT
so I did
df.groupby(['SEGMENT'])['CUST_ID'].nunique()
However if there are same CUST_ID
with different SEGMENT
types then the number per SEGMENT
gets inflated. I want to pick the highest value SEGMENT
per CUST_ID
and then count. A being the highest and C being the lowest. So the resulting df would look like:
CUST_ID | SEGMENT | AREA
1 | A | RAM
2 | B | CAD
3 | B | RAM
4 | A | RAM
and the count would be
- A - 2
- B - 2
- C - 0
How would I be able to do this?
CodePudding user response:
You can go like this:
(df.sort_values('SEGMENT').drop_duplicates('CUST_ID') # remove duplicates, keep only first 'CUST_ID'
.groupby('SEGMENT')['CUST_ID'].nunique() # or just `.size()` because there are no duplicates
)
CodePudding user response:
You can try groupby CUST_ID
column then filter rows by getting the min
value of SEGMENT
column.
out = (df.groupby(['CUST_ID'])
.apply(lambda g: g[g['SEGMENT'].eq(g['SEGMENT'].min())])
.reset_index(drop=True))
NOTE: Since you want to pick the highest value SEGMENT per CUST_ID and then count, A being the highest and C being the lowest, in ASCII talbe, A
is 65, C
is 67. When comparing, A actually is smaller than C. That's why use min
here.
print(out)
CUST_ID SEGMENT AREA
0 1 A RAM
1 2 B CAD
2 3 B RAM
3 4 A RAM
res = out.value_counts('SEGMENT')
print(res)
A 2
B 2
Name: SEGMENT, dtype: int64