I have an event calendar dataframe as follows.
calendar = pd.DataFrame({"name": ["event1", "event2", "event3"],
"loc": ["USA", "JPN", "USA"],
"start_date": ["2022-01-10", "2022-01-11", "2022-04-04"],
"end_date": ["2022-01-13", "2022-01-14", "2022-04-06"]})
calendar["start_date"] = pd.to_datetime(calendar["start_date"])
calendar["end_date"] = pd.to_datetime(calendar["end_date"])
daily= pd.DataFrame({"date": ["2022-01-11", "2022-01-12", "2022-04-01", "2022-05-01"]})
daily["date"] = pd.to_datetime(daily["date"])
My calendar dataframe look like this.
name loc start_date end_date
event1 USA 2022-01-10 2022-01-13
event2 JPN 2022-01-11 2022-01-14
event3 USA 2022-04-04 2022-04-06
My daily dataframe look like this.
date
2022-01-11
2022-01-12
2022-04-01
2022-05-01
I would like to get the additional column on the daily
called trigger
where each row represents a dictionary of loc and name for my further processing (It is possible that one date could have more than one event). The expected output is the following.
date trigger
2022-01-11 {"USA": "event1", "JPN": "event2"}
2022-01-12 {"JPN": "event2"}
2022-04-01 {}
2022-05-01 {}
I can think about looping over each row in daily
but it may not efficient way to do. May I have your suggestions for the more elegant way?
CodePudding user response:
Try:
- create a
date_range
of start_date to end_date for each row explode
the dates to individual rowsgroupby
to create the required dictionary structure andreindex
to daily values.
calendar["dates"] = calendar.apply(lambda row: pd.date_range(row["start_date"],row["end_date"],freq="D"), axis=1)
calendar = calendar.explode("dates")
output = (calendar.groupby("dates")[["loc", "name"]]
.apply(lambda x: dict(zip(x["loc"],x["name"])))
.rename("trigger")
.reindex(daily["date"])
)
>>> output
date
2022-01-11 {'USA': 'event1', 'JPN': 'event2'}
2022-01-12 {'USA': 'event1', 'JPN': 'event2'}
2022-04-01 NaN
2022-05-01 NaN
Name: trigger, dtype: object
CodePudding user response:
IIUC, your output is incorrect.
Here is an approach by computing all date of an event and using explode
to enable the match:
(daily
.merge(calendar
.assign(date=calendar.apply(lambda r: pd.date_range(r['start_date'], r['end_date']), axis=1))
.explode('date').drop(columns=['start_date', 'end_date']),
on='date', how='left')
.groupby('date').apply(lambda d: float('nan') if d['name'].isna().all() else
d.set_index('loc')['name'].to_dict()
)
)
output:
date
2022-01-11 {'USA': 'event1', 'JPN': 'event2'}
2022-01-12 {'USA': 'event1', 'JPN': 'event2'}
2022-04-01 NaN
2022-05-01 NaN
CodePudding user response:
Another version using the cross product of daily
and calendar
:
out = daily.merge(calendar, how='cross').query('date.between(start_date, end_date)') \
.groupby('date')[['loc', 'name']].apply(lambda x: dict(x.values)) \
.reindex(daily['date']).rename('trigger').reset_index()
print(out)
# Output
date trigger
0 2022-01-11 {'USA': 'event1', 'JPN': 'event2'}
1 2022-01-12 {'USA': 'event1', 'JPN': 'event2'}
2 2022-04-01 NaN
3 2022-05-01 NaN