According to this post, it looks similar to my problem but I tried every solutions in that post and it cannot solve my problem due to different data type (e.g. earlier post has lower bound and upper bound as date but mine are float) and the way to merge (the earlier post use outer merge to get exacy match while my problem need merge_asof with backward direction). Let me explain my problem again. I have a sample dataframe as follows
df = pd.DataFrame({"date": ["2022-04-10", "2022-04-11", "2022-04-12", "2022-04-13"],
"occupancy": [30, 55, 28, 100]})
df["date"] = pd.to_datetime(df["date"])
so my dataframe look like this
index date occupancy
0 2022-04-10 30
1 2022-04-11 55
2 2022-04-12 28
3 2022-04-13 100
There is another dataframe that represents predefined rules as follows.
rules = pd.DataFrame({"lower": [0, 25, 50, 75, 25],
"upper": [25, 50, 75, 100, 50],
"action": [-15, -5, 20, 25, -10],
"filedate": ["2022-04-01", "2022-04-01", "2022-04-01", "2022-04-01", "2022-04-12"]})
rules["filedate"] = pd.to_datetime(rules["filedate"])
The rules look like this.
index lower upper action filedate
0 0 25 -15 2022-04-01
1 25 50 -5 2022-04-01
2 50 75 20 2022-04-01
3 75 100 25 2022-04-01
4 25 50 -10 2022-04-12
I would like to take the action
column and add it to the df
. Merging conditions are, lower <= occupancy < upper
and filedate <= date
but the rule is selected based on the nearest filedate that is less than date. The expected output should be the following.
index date occupancy action
0 2022-04-10 30 -5
1 2022-04-11 55 20
2 2022-04-12 28 -10
3 2022-04-13 100 NaN
Explanations of the above output are the following. For example, row index 0 of df
, its occupancy
lies between lower
(25) and upper
(50). There are 2 rows in the rules
those satisfy such a condition (row index 1 and index row 4). However, action of row index 1 will be selected because the filedate 2022-04-01
is less than the date 2022-04-10
according to the condition filedate <= date
. We can think that the filedate
is an effective date to execute action.
Another example, row index 2 of df
, its occupancy
lies between lower
(25) and upper
(50). There are 2 rows in the rules
those satisfy such a condition (row index 1 and index row 4) but the action of row index 4 will be selected because the filedate 2022-04-12
is less than or equal to the date 2022-04-12
according to the condition filedate <= date
but the rule is selected based on the nearest filedate that is less than date.
I think my problem need to use merge_asof
but I could not figure out how to effectively implement it. May I have your suggestions?
CodePudding user response:
Ok. I can solve my problem now. It is not an efficient way but at least the problem solved. If there is the better way in term of efficiency, please advise.
df = pd.DataFrame({"date": ["2022-04-10", "2022-04-11", "2022-04-12", "2022-04-13"],
"occupancy": [30, 55, 28, 100]})
df["date"] = pd.to_datetime(df["date"])
rules = pd.DataFrame({"lower": [0, 25, 50, 75, 25],
"upper": [25, 50, 75, 100, 50],
"action": [-15, -5, 20, 25, -10],
"filedate": ["2022-04-01", "2022-04-01", "2022-04-01", "2022-04-01", "2022-04-12"]})
rules["filedate"] = pd.to_datetime(rules["filedate"])
actions = []
for i in df.index:
date = df.loc[i, "date"]
occ = df.loc[i, "occupancy"]
x = pd.DataFrame(df.loc[i, :]).T
subrules = rules[rules["filedate"] <= date]
mask = np.logical_and(subrules["lower"] <= occ, subrules["upper"] > occ)
temp = subrules[mask.values]
if temp.empty:
actions.append(np.nan)
else:
actions.append(pd.merge_asof(x, temp, left_on="date", right_on="filedate", direction="backward")["action"].values[0])
df["action"] = actions
CodePudding user response:
You could write a custom function to do the filtering and apply
it to each row of your DataFrame:
def custom_merge(row):
sample = rules[rules["lower"].le(row.at["occupancy"])&rules["upper"].gt(row.at["occupancy"])&rules["filedate"].le(row.at["date"])]
if sample.shape[0] > 0:
action = sample.loc[sample["filedate"].sub(row["date"]).abs().idxmin()]["action"]
else:
action = np.nan
return action
df["action"] = df.apply(custom_merge, axis=1)
>>> df
date occupancy action
0 2022-04-10 30 -5.0
1 2022-04-11 55 20.0
2 2022-04-12 28 -10.0
3 2022-04-13 100 NaN