I have a customer table. I am trying to filter each ParentCustomerID based on multiple points they have and select a row based on the below conditions:
IF 0 points & negative points, select the row with the highest negative point (i.e. -30 > -20)
IF 0 points & positive points, select the row with the highest positive point
IF Positive & Negative Points, select the row with the highest positive point
IF Positive, 0 points, and Negative points, select the row with the highest positive point
IF 0 Points mark, select any row with 0 points
IF All Negative, select the row with the highest negative point (i.e. -30 > -20)
1:M relationship between ParentCustomerID and ChildCustomerID
ParentCustomerID | ChildCustomerID | Points |
---|---|---|
101 | 1 | 0.0 |
101 | 2 | -20.0 |
101 | 3 | -30.50 |
102 | 4 | 20.86 |
102 | 5 | 0.0 |
102 | 6 | 50.0 |
103 | 7 | 10.0 |
103 | 8 | 50.0 |
103 | 9 | -30.0 |
104 | 10 | -30.0 |
104 | 11 | 0.0 |
104 | 12 | 60.80 |
104 | 13 | 40.0 |
105 | 14 | 0.0 |
105 | 15 | 0.0 |
105 | 16 | 0.0 |
106 | 17 | -20.0 |
106 | 18 | -30.80 |
106 | 19 | -40.20 |
Output should be:
ParentCustomerID | ChildCustomerID | Points |
---|---|---|
101 | 3 | -30.50 |
102 | 6 | 50.0 |
103 | 8 | 50.0 |
104 | 12 | 60.80 |
105 | 16 | 0.0 |
106 | 19 | -40.20 |
Note: for the rows customer 105, any row can be chosen because they all have 0 points.
Note2: Points can be float and ChildCustomerID can be missing (np.nan)
I do not know how to group each ParentCustomerID, check the above conditions, and select a specific row for each ParentCustomerID.
Thank you in advance!
CodePudding user response:
Code
df['abs'] = df['Points'].abs()
df['pri'] = np.sign(df['Points']).replace(0, -2)
(
df.sort_values(['pri', 'abs'])
.drop_duplicates('ParentCustomerID', keep='last')
.drop(['pri', 'abs'], axis=1)
.sort_index()
)
How this works
- Assign a temporary column named
abs
with the absolute values ofPoints
- Assign a temporary column named
pri
(priority) corresponding to arithmetic signs(i.e, -1, 0, 1) of values inPoints
, Important hack: replace0
with-2
so that zero always has least priority. - Sort the values by priority and absolute values
- Drop the duplicates in sorted dataframe keeping the last row per
ParentCustomerID
Result
ParentCustomerID ChildCustomerID Points
2 101 3 -30.5
5 102 6 50.0
7 103 8 50.0
11 104 12 60.8
15 105 16 0.0
18 106 19 -40.2
CodePudding user response:
import pandas as pd
import numpy as np
df = pd.DataFrame([
[101, 1, 0.0],
[101, 2, -20.0],
[101, 3, -30.50],
[102, 4, 20.86],
[102, 5, 0.0],
[102, 6, 50.0],
[103, 7, 10.0],
[103, 8, 50.0],
[103, 9, -30.0],
[104, 10, -30.0],
[104, 11, 0.0],
[104, 12, 60.80],
[104, 13, 40.0],
[105, 14, 0.0],
[105, 15, 0.0],
[105, 16, 0.0],
[106, 17, -20.0],
[106, 18, -30.80],
[106, 19, -40.20]
],columns=['ParentCustomerID', 'ChildCustomerID', 'Points'])
data = df.groupby('ParentCustomerID').agg({
'Points': [lambda x: np.argmax(x) if (np.array(x) > 0).sum() else np.argmin(x), list],
'ChildCustomerID': list
})
pd.DataFrame(data.apply(lambda x: (x["ChildCustomerID", "list"][x["Points", "<lambda_0>"]], x["Points", "list"][x["Points", "<lambda_0>"]]), axis=1).tolist(), index=data.index).rename(columns={
0: "ChildCustomerID",
1: "Points"
}).reset_index()