Home > Net >  Bucket Customers based on Points
Bucket Customers based on Points

Time:07-11

I have a customer table. I am trying to filter each ParentCustomerID based on multiple points they have and select a row based on the below conditions:

  • IF 0 points & negative points, select the row with the highest negative point (i.e. -30 > -20)

  • IF 0 points & positive points, select the row with the highest positive point

  • IF Positive & Negative Points, select the row with the highest positive point

  • IF Positive, 0 points, and Negative points, select the row with the highest positive point

  • IF 0 Points mark, select any row with 0 points

  • IF All Negative, select the row with the highest negative point (i.e. -30 > -20)

1:M relationship between ParentCustomerID and ChildCustomerID

ParentCustomerID ChildCustomerID Points
101 1 0.0
101 2 -20.0
101 3 -30.50
102 4 20.86
102 5 0.0
102 6 50.0
103 7 10.0
103 8 50.0
103 9 -30.0
104 10 -30.0
104 11 0.0
104 12 60.80
104 13 40.0
105 14 0.0
105 15 0.0
105 16 0.0
106 17 -20.0
106 18 -30.80
106 19 -40.20

Output should be:

ParentCustomerID ChildCustomerID Points
101 3 -30.50
102 6 50.0
103 8 50.0
104 12 60.80
105 16 0.0
106 19 -40.20

Note: for the rows customer 105, any row can be chosen because they all have 0 points.

Note2: Points can be float and ChildCustomerID can be missing (np.nan)

I do not know how to group each ParentCustomerID, check the above conditions, and select a specific row for each ParentCustomerID.

Thank you in advance!

CodePudding user response:

Code

df['abs'] = df['Points'].abs()
df['pri'] = np.sign(df['Points']).replace(0, -2)

(
    df.sort_values(['pri', 'abs'])
      .drop_duplicates('ParentCustomerID', keep='last')
      .drop(['pri', 'abs'], axis=1)
      .sort_index()
)

How this works

  • Assign a temporary column named abs with the absolute values of Points
  • Assign a temporary column named pri(priority) corresponding to arithmetic signs(i.e, -1, 0, 1) of values in Points, Important hack: replace 0 with -2 so that zero always has least priority.
  • Sort the values by priority and absolute values
  • Drop the duplicates in sorted dataframe keeping the last row per ParentCustomerID

Result

    ParentCustomerID  ChildCustomerID  Points
2                101                3   -30.5
5                102                6    50.0
7                103                8    50.0
11               104               12    60.8
15               105               16     0.0
18               106               19   -40.2

CodePudding user response:

import pandas as pd
import numpy as np

df = pd.DataFrame([
    [101,   1,  0.0],
    [101,   2,  -20.0],
    [101,   3,  -30.50],
    [102,   4,  20.86],
    [102,   5,  0.0],
    [102,   6,  50.0],
    [103,   7,  10.0],
    [103,   8,  50.0],
    [103,   9,  -30.0],
    [104,   10,     -30.0],
    [104,   11,     0.0],
    [104,   12,     60.80],
    [104,   13,     40.0],
    [105,   14,     0.0],
    [105,   15,     0.0],
    [105,   16,     0.0],
    [106,   17,     -20.0],
    [106,   18,     -30.80],
    [106,   19,     -40.20]
],columns=['ParentCustomerID',  'ChildCustomerID',  'Points'])

data = df.groupby('ParentCustomerID').agg({
  'Points': [lambda x: np.argmax(x) if (np.array(x) > 0).sum() else np.argmin(x), list],
  'ChildCustomerID': list
})

pd.DataFrame(data.apply(lambda x: (x["ChildCustomerID", "list"][x["Points", "<lambda_0>"]], x["Points", "list"][x["Points", "<lambda_0>"]]), axis=1).tolist(), index=data.index).rename(columns={
    0: "ChildCustomerID",
    1: "Points"
}).reset_index()
  • Related