Home > Software design >  Detecting rows from 2 columns of dates and return as dictionary
Detecting rows from 2 columns of dates and return as dictionary

Time:02-23

I have an event calendar dataframe as follows.

calendar = pd.DataFrame({"name": ["event1", "event2", "event3"], 
                         "loc": ["USA", "JPN", "USA"],
                         "start_date": ["2022-01-10", "2022-01-11", "2022-04-04"],
                         "end_date": ["2022-01-13", "2022-01-14", "2022-04-06"]})
calendar["start_date"] = pd.to_datetime(calendar["start_date"])
calendar["end_date"] = pd.to_datetime(calendar["end_date"])

daily= pd.DataFrame({"date": ["2022-01-11", "2022-01-12", "2022-04-01", "2022-05-01"]})
daily["date"] = pd.to_datetime(daily["date"])

My calendar dataframe look like this.

name     loc     start_date    end_date
event1   USA     2022-01-10    2022-01-13
event2   JPN     2022-01-11    2022-01-14
event3   USA     2022-04-04    2022-04-06

My daily dataframe look like this.

date
2022-01-11
2022-01-12
2022-04-01
2022-05-01

I would like to get the additional column on the daily called trigger where each row represents a dictionary of loc and name for my further processing (It is possible that one date could have more than one event). The expected output is the following.

date          trigger
2022-01-11    {"USA": "event1", "JPN": "event2"}
2022-01-12    {"JPN": "event2"}
2022-04-01    {}
2022-05-01    {}

I can think about looping over each row in daily but it may not efficient way to do. May I have your suggestions for the more elegant way?

CodePudding user response:

Try:

  1. create a date_range of start_date to end_date for each row
  2. explode the dates to individual rows
  3. groupby to create the required dictionary structure and reindex to daily values.
calendar["dates"] = calendar.apply(lambda row: pd.date_range(row["start_date"],row["end_date"],freq="D"), axis=1)
calendar = calendar.explode("dates")
output = (calendar.groupby("dates")[["loc", "name"]]
                  .apply(lambda x: dict(zip(x["loc"],x["name"])))
                  .rename("trigger")
                  .reindex(daily["date"])
          )

>>> output
date
2022-01-11    {'USA': 'event1', 'JPN': 'event2'}
2022-01-12    {'USA': 'event1', 'JPN': 'event2'}
2022-04-01                                   NaN
2022-05-01                                   NaN
Name: trigger, dtype: object

CodePudding user response:

IIUC, your output is incorrect.

Here is an approach by computing all date of an event and using explode to enable the match:

(daily
 .merge(calendar
 .assign(date=calendar.apply(lambda r: pd.date_range(r['start_date'], r['end_date']), axis=1))
 .explode('date').drop(columns=['start_date', 'end_date']),
        on='date', how='left')
 .groupby('date').apply(lambda d: float('nan') if d['name'].isna().all() else
                                  d.set_index('loc')['name'].to_dict()
                       )
)

output:

date
2022-01-11    {'USA': 'event1', 'JPN': 'event2'}
2022-01-12    {'USA': 'event1', 'JPN': 'event2'}
2022-04-01                                   NaN
2022-05-01                                   NaN

CodePudding user response:

Another version using the cross product of daily and calendar:

out = daily.merge(calendar, how='cross').query('date.between(start_date, end_date)') \
           .groupby('date')[['loc', 'name']].apply(lambda x: dict(x.values)) \
           .reindex(daily['date']).rename('trigger').reset_index()
print(out)

# Output
        date                             trigger
0 2022-01-11  {'USA': 'event1', 'JPN': 'event2'}
1 2022-01-12  {'USA': 'event1', 'JPN': 'event2'}
2 2022-04-01                                 NaN
3 2022-05-01                                 NaN
  • Related