Bucket Customers based on Points-CodePudding

I have a customer table. I am trying to filter each ParentCustomerID based on multiple points they have and select a row based on the below conditions:

IF 0 points & negative points, select the row with the highest negative point (i.e. -30 > -20)
IF 0 points & positive points, select the row with the highest positive point
IF Positive & Negative Points, select the row with the highest positive point
IF Positive, 0 points, and Negative points, select the row with the highest positive point
IF 0 Points mark, select any row with 0 points
IF All Negative, select the row with the highest negative point (i.e. -30 > -20)

1:M relationship between ParentCustomerID and ChildCustomerID

ParentCustomerID	ChildCustomerID	Points
101	1	0.0
101	2	-20.0
101	3	-30.50
102	4	20.86
102	5	0.0
102	6	50.0
103	7	10.0
103	8	50.0
103	9	-30.0
104	10	-30.0
104	11	0.0
104	12	60.80
104	13	40.0
105	14	0.0
105	15	0.0
105	16	0.0
106	17	-20.0
106	18	-30.80
106	19	-40.20

Output should be:

ParentCustomerID	ChildCustomerID	Points
101	3	-30.50
102	6	50.0
103	8	50.0
104	12	60.80
105	16	0.0
106	19	-40.20

Note: for the rows customer 105, any row can be chosen because they all have 0 points.

Note2: Points can be float and ChildCustomerID can be missing (np.nan)

I do not know how to group each ParentCustomerID, check the above conditions, and select a specific row for each ParentCustomerID.

Thank you in advance!

CodePudding user response：

Code

df['abs'] = df['Points'].abs()
df['pri'] = np.sign(df['Points']).replace(0, -2)

(
    df.sort_values(['pri', 'abs'])
      .drop_duplicates('ParentCustomerID', keep='last')
      .drop(['pri', 'abs'], axis=1)
      .sort_index()
)

How this works

Assign a temporary column named abs with the absolute values of Points
Assign a temporary column named pri(priority) corresponding to arithmetic signs(i.e, -1, 0, 1) of values in Points, Important hack: replace 0 with -2 so that zero always has least priority.
Sort the values by priority and absolute values
Drop the duplicates in sorted dataframe keeping the last row per ParentCustomerID

Result

    ParentCustomerID  ChildCustomerID  Points
2                101                3   -30.5
5                102                6    50.0
7                103                8    50.0
11               104               12    60.8
15               105               16     0.0
18               106               19   -40.2

CodePudding user response：

import pandas as pd
import numpy as np

df = pd.DataFrame([
    [101,   1,  0.0],
    [101,   2,  -20.0],
    [101,   3,  -30.50],
    [102,   4,  20.86],
    [102,   5,  0.0],
    [102,   6,  50.0],
    [103,   7,  10.0],
    [103,   8,  50.0],
    [103,   9,  -30.0],
    [104,   10,     -30.0],
    [104,   11,     0.0],
    [104,   12,     60.80],
    [104,   13,     40.0],
    [105,   14,     0.0],
    [105,   15,     0.0],
    [105,   16,     0.0],
    [106,   17,     -20.0],
    [106,   18,     -30.80],
    [106,   19,     -40.20]
],columns=['ParentCustomerID',  'ChildCustomerID',  'Points'])

data = df.groupby('ParentCustomerID').agg({
  'Points': [lambda x: np.argmax(x) if (np.array(x) > 0).sum() else np.argmin(x), list],
  'ChildCustomerID': list
})

pd.DataFrame(data.apply(lambda x: (x["ChildCustomerID", "list"][x["Points", "<lambda_0>"]], x["Points", "list"][x["Points", "<lambda_0>"]]), axis=1).tolist(), index=data.index).rename(columns={
    0: "ChildCustomerID",
    1: "Points"
}).reset_index()