Pandas - Best way to derive a new column by checking condition on other rows of the same dataframe-CodePudding

Here is my data,

df = pd.DataFrame()
df["GroupType"] = ["Group_1","Group_2","Group_3","Group_1","Group_2","Group_3",
                   "Group_1","Group_2","Group_3","Group_1","Group_4","Group_4","Group_4"]
df["ID_1"] = ["10","10","10",
              "10","10","10",
              "10","10", "10",
              "12","12","12","13"]
df["ID_2"] = [pd.NA,"100",pd.NA,
              pd.NA,"200",pd.NA,
              pd.NA,"300",pd.NA,
              "400","400","400",pd.NA]
df["Price"] = [1,2,3,4,5,6,7,8,9,10,11,12,13]
df
   GroupType ID_1  ID_2  Price
0    Group_1   10  <NA>      1
1    Group_2   10   100      2
2    Group_3   10  <NA>      3
3    Group_1   10  <NA>      4
4    Group_2   10   200      5
5    Group_3   10  <NA>      6
6    Group_1   10  <NA>      7
7    Group_2   10   300      8
8    Group_3   10  <NA>      9
9    Group_1   12   400     10
10   Group_4   12   400     11
11   Group_4   12   400     12
12   Group_4   13  <NA>     13

The requirement is to derive the 'Actual Price'. Logic is if 'GroupType' == 'Group_4' then get 'Price' from the matching Group_1 with same 'ID_2'. For example row 9 above is the matching 'Group_1' record for rows 10,and 11. And if 'GroupType' != 'Group_4' then populate just 'Price'.

Expected Dataframe:

   GroupType ID_1  ID_2  Price  Actual_Price
0    Group_1   10  <NA>      1             1
1    Group_2   10   100      2             2
2    Group_3   10  <NA>      3             3
3    Group_1   10  <NA>      4             4
4    Group_2   10   200      5             5
5    Group_3   10  <NA>      6             6
6    Group_1   10  <NA>      7             7
7    Group_2   10   300      8             8
8    Group_3   10  <NA>      9             9
9    Group_1   12   400     10            10
10   Group_4   12   400     11            10
11   Group_4   12   400     12            10
12   Group_4   13  <NA>     13            pd.NA

My solution:

def get_linked_actual_price(
        x: pd.Series, source_df: pd.DataFrame
):
    mask = (source_df["ID_2"]== x["ID_2"]) & (source_df["GroupType"]== "Group_1")
    if not mask.any():
        return pd.NA
    source_df = source_df.loc[mask]
    if len(source_df.index) > 1:
        source_df = source_df[:1]
    result_series = source_df.squeeze()
    return result_series.get("Price", pd.NA)

group_4_mask = df["GroupType"] == "Group_4"
df.loc[~group_4_mask,"Actual Price"] = df.loc[~group_4_mask, "Price"]
df.loc[group_4_mask, "Actual_Price"] = df.loc[group_4_mask].apply(get_linked_actual_price, axis=1, source_df=df)
print(df)

My concern is since I am using the apply on each row this is not performatic. So wanted to check if there is a better solution in terms of performance?

CodePudding user response：

Here is how I would approach this problem

Mask the values in Price where GroupType is not equal to Group_1 then group the masked column by ID_2 and transform using first
Update the values in Price column where GroupType is Group_4 using the values obtained from step 1

m = df['GroupType'].eq('Group_1')
s = df['Price'].where(m).groupby(df['ID_2']).transform('first')
df['Price'] = df['Price'].mask(df['GroupType'].eq('Group_4'), s)

   GroupType ID_1  ID_2  Price  Actual_price
0    Group_1   10  <NA>      1           1.0
1    Group_2   10   100      2           2.0
2    Group_3   10  <NA>      3           3.0
3    Group_1   10  <NA>      4           4.0
4    Group_2   10   200      5           5.0
5    Group_3   10  <NA>      6           6.0
6    Group_1   10  <NA>      7           7.0
7    Group_2   10   300      8           8.0
8    Group_3   10  <NA>      9           9.0
9    Group_1   12   400     10          10.0
10   Group_4   12   400     11          10.0
11   Group_4   12   400     12          10.0
12   Group_4   13  <NA>     13           NaN