Here is my data,
df = pd.DataFrame()
df["GroupType"] = ["Group_1","Group_2","Group_3","Group_1","Group_2","Group_3",
"Group_1","Group_2","Group_3","Group_1","Group_4","Group_4","Group_4"]
df["ID_1"] = ["10","10","10",
"10","10","10",
"10","10", "10",
"12","12","12","13"]
df["ID_2"] = [pd.NA,"100",pd.NA,
pd.NA,"200",pd.NA,
pd.NA,"300",pd.NA,
"400","400","400",pd.NA]
df["Price"] = [1,2,3,4,5,6,7,8,9,10,11,12,13]
df
GroupType ID_1 ID_2 Price
0 Group_1 10 <NA> 1
1 Group_2 10 100 2
2 Group_3 10 <NA> 3
3 Group_1 10 <NA> 4
4 Group_2 10 200 5
5 Group_3 10 <NA> 6
6 Group_1 10 <NA> 7
7 Group_2 10 300 8
8 Group_3 10 <NA> 9
9 Group_1 12 400 10
10 Group_4 12 400 11
11 Group_4 12 400 12
12 Group_4 13 <NA> 13
The requirement is to derive the 'Actual Price'. Logic is if 'GroupType' == 'Group_4' then get 'Price' from the matching Group_1 with same 'ID_2'. For example row 9 above is the matching 'Group_1' record for rows 10,and 11. And if 'GroupType' != 'Group_4' then populate just 'Price'.
Expected Dataframe:
GroupType ID_1 ID_2 Price Actual_Price
0 Group_1 10 <NA> 1 1
1 Group_2 10 100 2 2
2 Group_3 10 <NA> 3 3
3 Group_1 10 <NA> 4 4
4 Group_2 10 200 5 5
5 Group_3 10 <NA> 6 6
6 Group_1 10 <NA> 7 7
7 Group_2 10 300 8 8
8 Group_3 10 <NA> 9 9
9 Group_1 12 400 10 10
10 Group_4 12 400 11 10
11 Group_4 12 400 12 10
12 Group_4 13 <NA> 13 pd.NA
My solution:
def get_linked_actual_price(
x: pd.Series, source_df: pd.DataFrame
):
mask = (source_df["ID_2"]== x["ID_2"]) & (source_df["GroupType"]== "Group_1")
if not mask.any():
return pd.NA
source_df = source_df.loc[mask]
if len(source_df.index) > 1:
source_df = source_df[:1]
result_series = source_df.squeeze()
return result_series.get("Price", pd.NA)
group_4_mask = df["GroupType"] == "Group_4"
df.loc[~group_4_mask,"Actual Price"] = df.loc[~group_4_mask, "Price"]
df.loc[group_4_mask, "Actual_Price"] = df.loc[group_4_mask].apply(get_linked_actual_price, axis=1, source_df=df)
print(df)
My concern is since I am using the apply on each row this is not performatic. So wanted to check if there is a better solution in terms of performance?
CodePudding user response:
Here is how I would approach this problem
- Mask the values in
Price
whereGroupType
is not equal toGroup_1
then group the masked column byID_2
andtransform
usingfirst
- Update the values in
Price
column whereGroupType
isGroup_4
using the values obtained fromstep 1
m = df['GroupType'].eq('Group_1')
s = df['Price'].where(m).groupby(df['ID_2']).transform('first')
df['Price'] = df['Price'].mask(df['GroupType'].eq('Group_4'), s)
GroupType ID_1 ID_2 Price Actual_price
0 Group_1 10 <NA> 1 1.0
1 Group_2 10 100 2 2.0
2 Group_3 10 <NA> 3 3.0
3 Group_1 10 <NA> 4 4.0
4 Group_2 10 200 5 5.0
5 Group_3 10 <NA> 6 6.0
6 Group_1 10 <NA> 7 7.0
7 Group_2 10 300 8 8.0
8 Group_3 10 <NA> 9 9.0
9 Group_1 12 400 10 10.0
10 Group_4 12 400 11 10.0
11 Group_4 12 400 12 10.0
12 Group_4 13 <NA> 13 NaN