Home > Software engineering >  Filling dataframe column with calculated value from a different dataframe
Filling dataframe column with calculated value from a different dataframe

Time:10-29

I have two dataframes. The first contains a series of ID numbers with two dates, and the second contains a series of ID numbers with a date for an event.

import pandas as pd
df1 = pd.DataFrame({'id': ['19403', '59208', '45290'], 'date_1': ['2021-01-04', '2021-01-05', '2021-05-08'], 'date_2': ['2021-03-07', '2021-01-12', '2021-06-15']})
df2 = pd.DataFrame({'id': ['19403', '19403', '19403', '59208', '59208', '45290', '45290', '45290', '45290', '45290', '45290', '45290'], 'date': ['2021-01-05', '2021-02-12', '2021-03-17', '2021-01-06', '2021-01-08', '2021-03-14', '2021-03-17', '2021-05-09', '2021-05-12', '2021-05-23', '2021-06-01', '2021-08-14' ]})

I'm trying to merge the two dataframes so that for every row in the first dataframe, it counts the number of events in the second where the date is between the two dates in the first and the ID number matches.

output = pd.DataFrame({'id': ['19403', '59208', '45290'], 'date_1': ['2021-01-04', '2021-01-05', '2021-05-08'], 'date_2': ['2021-03-07', '2021-01-12', '2021-06-15'], 'count': [2, 2, 4]})

I've been able to mock it up one row at a time (below), but I'm struggling to apply it to the whole dataframe.

count = df2[(df2['id'] == '19403') & (df2['date'].between('2021-01-04', '2021-03-07'))]['date'].count()

Any suggestions? I'm probably missing something obvious here.

CodePudding user response:

(df1.merge(df2, on='id', how='outer')
    .query("(date_1 <= date <= date_2)")
    .groupby(['id', 'date_1', 'date_2'], as_index=False)
    .size())

Output:

    id      date_1      date_2    size
0   19403   2021-01-04  2021-03-07  2
1   45290   2021-05-08  2021-06-15  4
2   59208   2021-01-05  2021-01-12  2
  • Related