I'm trying to solve this issue. I have two dataframe. The first one looks like:
ID | start.date | end.date |
---|---|---|
272 | 2007-03-27 10:37:00 | 2007-03-27 15:09:00 |
290 | 2007-04-10 14:12:00 | 2007-04-10 15:51:00 |
268 | 2007-03-23 18:18:00 | 2007-03-23 18:24:00 |
264 | 2007-04-05 06:54:00 | 2007-04-09 06:45:00 |
105 | 2007-04-18 10:51:00 | 2007-04-18 13:37:00 |
280 | 2007-03-30 11:09:00 | 2007-04-02 06:27:00 |
99 | 2007-03-28 12:12:00 | 2007-03-28 15:22:00 |
268 | 2007-03-27 10:41:00 | 2007-03-27 10:54:00 |
263 | 2007-03-28 11:08:00 | 2007-03-28 12:45:00 |
264 | 2007-03-28 07:12:00 | 2007-03-28 11:08:00 |
While the second one looks like:
ID | date |
---|---|
266 | 2007-03-30 17:17:10 |
272 | 2007-03-30 14:23:39 |
268 | 2007-03-30 09:12:48 |
264 | 2007-03-30 18:57:57 |
276 | 2007-04-02 14:30:02 |
106 | 2007-03-28 11:35:49 |
276 | 2007-03-30 13:40:24 |
82 | 2007-03-27 17:29:28 |
104 | 2007-03-28 17:50:12 |
264 | 2007-03-29 14:41:16 |
I would like to add a column to the first dataframe with the count of the rows in the second dataframe with that ID and with a date value between the start.date and end.date of the first dataframe. How can I do it?
CodePudding user response:
You can try apply
on rows:
df1['start.date'] = pd.to_datetime(df1['start.date'])
df1['end.date'] = pd.to_datetime(df1['end.date'])
df2['date'] = pd.to_datetime(df2['date'])
df1['count'] = df1.apply(lambda row: (df2['date'].eq(row['ID']) & (row['start.date'] < df2['date']) & (df2['date'] < row['end.date'])).sum(), axis=1)
# or
df1['count2'] = df1.apply(lambda row: (df2['date'].eq(row['ID']) & df2['date'].between(row['start.date'], row['end.date'], inclusive='neither')).sum(), axis=1)
print(df1)
ID start.date end.date count count2
0 272 2007-03-27 10:37:00 2007-03-27 15:09:00 0 0
1 290 2007-04-10 14:12:00 2007-04-10 15:51:00 0 0
2 268 2007-03-23 18:18:00 2007-03-23 18:24:00 0 0
3 264 2007-04-05 06:54:00 2007-04-09 06:45:00 0 0
4 105 2007-04-18 10:51:00 2007-04-18 13:37:00 0 0
5 280 2007-03-30 11:09:00 2007-04-02 06:27:00 0 0
6 99 2007-03-28 12:12:00 2007-03-28 15:22:00 0 0
7 268 2007-03-27 10:41:00 2007-03-27 10:54:00 0 0
8 263 2007-03-28 11:08:00 2007-03-28 12:45:00 0 0
9 264 2007-03-28 07:12:00 2007-03-28 11:08:00 0 0
CodePudding user response:
Perfect job for numpy boardcasting:
id1, start_date, end_date = [df1[[col]].to_numpy() for col in ["ID", "start.date", "end.date"]]
id2, date = [df2[col].to_numpy() for col in ["ID", "date"]]
# Check every row in df1 against every row in df2 for our criteria:
# matching id, and date between start.date and end.date
match = (id1 == id2) & (start_date < date) & (date < end_date)
df1["count"] = match.sum(axis=1)