Pandas pick the higher value for each unique id-CodePudding

I have a df of customers

CUST_ID | SEGMENT | AREA
  1     |  B      | CAD
  1     |  A      | RAM
  2     |  B      | CAD
  2     |  C      | RAM
  3     |  B      | RAM
  4     |  A      | RAM

I want to count the unique number of CUST_ID per SEGMENT so I did

df.groupby(['SEGMENT'])['CUST_ID'].nunique()

However if there are same CUST_ID with different SEGMENT types then the number per SEGMENT gets inflated. I want to pick the highest value SEGMENT per CUST_ID and then count. A being the highest and C being the lowest. So the resulting df would look like:

CUST_ID | SEGMENT | AREA
  1     |  A      | RAM
  2     |  B      | CAD
  3     |  B      | RAM
  4     |  A      | RAM

and the count would be

A - 2
B - 2
C - 0

How would I be able to do this?

CodePudding user response：

You can go like this:

(df.sort_values('SEGMENT').drop_duplicates('CUST_ID') # remove duplicates, keep only first 'CUST_ID'
   .groupby('SEGMENT')['CUST_ID'].nunique() # or just `.size()` because there are no duplicates
)

CodePudding user response：

You can try groupby CUST_ID column then filter rows by getting the min value of SEGMENT column.

out = (df.groupby(['CUST_ID'])
       .apply(lambda g: g[g['SEGMENT'].eq(g['SEGMENT'].min())])
       .reset_index(drop=True))

NOTE: Since you want to pick the highest value SEGMENT per CUST_ID and then count, A being the highest and C being the lowest, in ASCII talbe, A is 65, C is 67. When comparing, A actually is smaller than C. That's why use min here.

print(out)

   CUST_ID SEGMENT AREA
0        1       A  RAM
1        2       B  CAD
2        3       B  RAM
3        4       A  RAM

res = out.value_counts('SEGMENT')

print(res)

A    2
B    2
Name: SEGMENT, dtype: int64