Home > front end >  Merge 2 pandas dataframes conditional by upper bound (float) and lower bound (float) and merge_asof
Merge 2 pandas dataframes conditional by upper bound (float) and lower bound (float) and merge_asof

Time:03-10

According to this post, it looks similar to my problem but I tried every solutions in that post and it cannot solve my problem due to different data type (e.g. earlier post has lower bound and upper bound as date but mine are float) and the way to merge (the earlier post use outer merge to get exacy match while my problem need merge_asof with backward direction). Let me explain my problem again. I have a sample dataframe as follows

df = pd.DataFrame({"date": ["2022-04-10", "2022-04-11", "2022-04-12", "2022-04-13"],
                   "occupancy": [30, 55, 28, 100]})
df["date"] = pd.to_datetime(df["date"])

so my dataframe look like this

index   date           occupancy
    0   2022-04-10      30
    1   2022-04-11      55
    2   2022-04-12      28
    3   2022-04-13     100

There is another dataframe that represents predefined rules as follows.

rules = pd.DataFrame({"lower": [0, 25, 50, 75, 25],
                      "upper": [25, 50, 75, 100, 50],
                      "action": [-15, -5, 20, 25, -10],
                      "filedate": ["2022-04-01", "2022-04-01", "2022-04-01", "2022-04-01", "2022-04-12"]})
rules["filedate"] = pd.to_datetime(rules["filedate"])

The rules look like this.

index   lower       upper      action         filedate
    0       0          25         -15         2022-04-01
    1      25          50          -5         2022-04-01
    2      50          75          20         2022-04-01
    3      75         100          25         2022-04-01
    4      25          50         -10         2022-04-12

I would like to take the action column and add it to the df. Merging conditions are, lower <= occupancy < upper and filedate <= date but the rule is selected based on the nearest filedate that is less than date. The expected output should be the following.

index   date           occupancy  action
    0   2022-04-10      30            -5
    1   2022-04-11      55            20
    2   2022-04-12      28           -10
    3   2022-04-13     100           NaN

Explanations of the above output are the following. For example, row index 0 of df, its occupancy lies between lower (25) and upper (50). There are 2 rows in the rules those satisfy such a condition (row index 1 and index row 4). However, action of row index 1 will be selected because the filedate 2022-04-01 is less than the date 2022-04-10 according to the condition filedate <= date. We can think that the filedate is an effective date to execute action.

Another example, row index 2 of df, its occupancy lies between lower (25) and upper (50). There are 2 rows in the rules those satisfy such a condition (row index 1 and index row 4) but the action of row index 4 will be selected because the filedate 2022-04-12 is less than or equal to the date 2022-04-12 according to the condition filedate <= date but the rule is selected based on the nearest filedate that is less than date.

I think my problem need to use merge_asof but I could not figure out how to effectively implement it. May I have your suggestions?

CodePudding user response:

Ok. I can solve my problem now. It is not an efficient way but at least the problem solved. If there is the better way in term of efficiency, please advise.

df = pd.DataFrame({"date": ["2022-04-10", "2022-04-11", "2022-04-12", "2022-04-13"],
                       "occupancy": [30, 55, 28, 100]})
    df["date"] = pd.to_datetime(df["date"])
    
    rules = pd.DataFrame({"lower": [0, 25, 50, 75, 25],
                          "upper": [25, 50, 75, 100, 50],
                          "action": [-15, -5, 20, 25, -10],
                          "filedate": ["2022-04-01", "2022-04-01", "2022-04-01", "2022-04-01", "2022-04-12"]})
    rules["filedate"] = pd.to_datetime(rules["filedate"])
    
    
    
    actions = []
    for i in df.index:
        date = df.loc[i, "date"]
        occ = df.loc[i, "occupancy"]
        x = pd.DataFrame(df.loc[i, :]).T
        subrules = rules[rules["filedate"] <= date]
        mask = np.logical_and(subrules["lower"] <= occ, subrules["upper"] > occ)
        temp = subrules[mask.values]
        if temp.empty:
            actions.append(np.nan)
        else:
            actions.append(pd.merge_asof(x, temp, left_on="date", right_on="filedate", direction="backward")["action"].values[0])
    df["action"] = actions

CodePudding user response:

You could write a custom function to do the filtering and apply it to each row of your DataFrame:

def custom_merge(row):
    sample = rules[rules["lower"].le(row.at["occupancy"])&rules["upper"].gt(row.at["occupancy"])&rules["filedate"].le(row.at["date"])]
    if sample.shape[0] > 0:
        action = sample.loc[sample["filedate"].sub(row["date"]).abs().idxmin()]["action"]
    else:
        action = np.nan
    return action

df["action"] = df.apply(custom_merge, axis=1)

>>> df
        date  occupancy  action
0 2022-04-10         30    -5.0
1 2022-04-11         55    20.0
2 2022-04-12         28   -10.0
3 2022-04-13        100     NaN
  • Related