Home > Back-end >  Pandas count observations in one dataframe conditionally on values of other dataframe
Pandas count observations in one dataframe conditionally on values of other dataframe

Time:05-02

I'm trying to solve this issue. I have two dataframe. The first one looks like:

ID start.date end.date
272 2007-03-27 10:37:00 2007-03-27 15:09:00
290 2007-04-10 14:12:00 2007-04-10 15:51:00
268 2007-03-23 18:18:00 2007-03-23 18:24:00
264 2007-04-05 06:54:00 2007-04-09 06:45:00
105 2007-04-18 10:51:00 2007-04-18 13:37:00
280 2007-03-30 11:09:00 2007-04-02 06:27:00
99 2007-03-28 12:12:00 2007-03-28 15:22:00
268 2007-03-27 10:41:00 2007-03-27 10:54:00
263 2007-03-28 11:08:00 2007-03-28 12:45:00
264 2007-03-28 07:12:00 2007-03-28 11:08:00

While the second one looks like:

ID date
266 2007-03-30 17:17:10
272 2007-03-30 14:23:39
268 2007-03-30 09:12:48
264 2007-03-30 18:57:57
276 2007-04-02 14:30:02
106 2007-03-28 11:35:49
276 2007-03-30 13:40:24
82 2007-03-27 17:29:28
104 2007-03-28 17:50:12
264 2007-03-29 14:41:16

I would like to add a column to the first dataframe with the count of the rows in the second dataframe with that ID and with a date value between the start.date and end.date of the first dataframe. How can I do it?

CodePudding user response:

You can try apply on rows:

df1['start.date'] = pd.to_datetime(df1['start.date'])
df1['end.date'] = pd.to_datetime(df1['end.date'])
df2['date'] = pd.to_datetime(df2['date'])


df1['count'] = df1.apply(lambda row: (df2['date'].eq(row['ID']) & (row['start.date'] < df2['date']) & (df2['date'] < row['end.date'])).sum(), axis=1)

# or

df1['count2'] = df1.apply(lambda row: (df2['date'].eq(row['ID']) & df2['date'].between(row['start.date'], row['end.date'], inclusive='neither')).sum(), axis=1)
print(df1)

    ID          start.date            end.date  count  count2
0  272 2007-03-27 10:37:00 2007-03-27 15:09:00      0       0
1  290 2007-04-10 14:12:00 2007-04-10 15:51:00      0       0
2  268 2007-03-23 18:18:00 2007-03-23 18:24:00      0       0
3  264 2007-04-05 06:54:00 2007-04-09 06:45:00      0       0
4  105 2007-04-18 10:51:00 2007-04-18 13:37:00      0       0
5  280 2007-03-30 11:09:00 2007-04-02 06:27:00      0       0
6   99 2007-03-28 12:12:00 2007-03-28 15:22:00      0       0
7  268 2007-03-27 10:41:00 2007-03-27 10:54:00      0       0
8  263 2007-03-28 11:08:00 2007-03-28 12:45:00      0       0
9  264 2007-03-28 07:12:00 2007-03-28 11:08:00      0       0

CodePudding user response:

Perfect job for numpy boardcasting:

id1, start_date, end_date = [df1[[col]].to_numpy() for col in ["ID", "start.date", "end.date"]]
id2, date = [df2[col].to_numpy() for col in ["ID", "date"]]

# Check every row in df1 against every row in df2 for our criteria:
# matching id, and date between start.date and end.date
match = (id1 == id2) & (start_date < date) & (date < end_date)
df1["count"] = match.sum(axis=1)
  • Related