I have code that can do this, but I am iterating through each row of the dataframe with iterrows()
. It takes quite a long time to process considering it's checking through over 6M rows. And want to use vectorisation to speed it up.
I've looked at using pd.Grouper
and freq
, but have gotten stuck on how to use the 2 dataframes to do this check with that.
Given the 2 dataframes below:
I want to look at all rows in df1
(grouped by 'sid'
and 'modtype'
):
df1:
sid servid date modtype service
0 123 881 2022-07-05 A1 z
1 456 879 2022-07-02 A2 z
Then find them in df2
and count the occurrences of those groups within 3 days of the date of that group in df1
, to get a count of how many times that group comes within 3 days before, and a count of occurrences it comes within 3 days after.
df2:
sid servid date modtype
0 123 1234 2022-07-03 A1
1 123 881 2022-07-05 A1
2 123 65781 2022-07-06 A1
3 123 8552 2022-07-30 A1
4 123 3453 2022-07-04 A2
5 123 5681 2022-07-07 A2
6 456 78 2022-07-01 A1
7 456 26744 2022-05-05 A2
8 456 56166 2022-06-29 A2
9 456 56717 2022-06-30 A2
10 456 879 2022-07-02 A2
11 456 56 2022-07-25 A2
So, essentially, in the sample set which I provide below, my output would end up with:
sid servid date modtype service cnt_3day_before cnt_3day_after
0 123 881 2022-07-05 A1 z 1 1
1 456 879 2022-07-02 A2 z 2 0
Sample set:
import pandas as pd
data1 = {
'sid':['123','456'],
'servid':['881','879'],
'date':['2022-07-05','2022-07-02'],
'modtype':['A1','A2'],
'service':['z','z']}
df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)
data2 = {
'sid':['123','123','123','123','123','123',
'456','456','456','456','456','456'],
'servid':['1234','3453','881','65781','5681','8552',
'26744','56717','879','56166','56','78'],
'date':['2022-07-03','2022-07-04','2022-07-05','2022-07-06','2022-07-07','2022-07-30',
'2022-05-05','2022-06-30','2022-07-02','2022-06-29','2022-07-25','2022-07-01'],
'modtype':['A1','A2','A1','A1','A2','A1',
'A2','A2','A2','A2','A2','A1']}
df2 = pd.DataFrame(data2)
df2['date'] = pd.to_datetime(df2['date'])
df2 = df2.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)
CodePudding user response:
I think there are definitely exist faster solutions, but you can try this one. It iterates over "queries" in df1
and for each query
computes number of events in df2
that happened before and after 3 days. To calculate number of such events we first set sid
and modtype
to be an index column, then we select matching events by index and calculate time difference between the selected events and query, then we just count ones that happened in /- 3 days. This place can be optimized with binary search to give you O(logN) instead of O(N) complexity in case you have sorted date column.
df2 = df2.set_index(['sid', 'modtype'])
seconds_in_3days = 3*24*60*60
def before_and_after_3days(query):
dates = df2.loc[tuple(query[['sid', 'modtype']]), 'date']
secs = (dates - query['date']).dt.total_seconds().astype(int)
before = ((-seconds_in_3days <= secs) & (secs < 0)).sum()
after = ((0 < secs) & (secs < seconds_in_3days)).sum()
return before, after
before_after = df1.apply(before_and_after_3days, axis=1)
df1[['cnt_3day_before', 'cnt_3day_after']] = before_after.tolist()
CodePudding user response:
Here is a partial solution. No time to do the full thing. Might have time later. But thought I'd pass along the idea in case it might help you get moving in the right direction.
def a(x):
s = x['sid_y'].isna()
if s.all():
return pd.Series([0,0], index=['before','after'])
idx = (~s).idxmax()
nb_before = ((x.loc[idx,'date'] > x['date']) & (x.loc[idx,'date'] - x['date'] <= pd.Timedelta('3D'))).sum()
nb_after = ((x.loc[idx,'date'] < x['date']) & (x['date'] - x.loc[idx,'date'] < pd.Timedelta('3D'))).sum()
return pd.Series([nb_before,nb_after], index=['before','after'])
df2.merge(df1, how='left', on='date').groupby(['sid_x','modtype_x']).apply(a)
Result
before after
sid_x modtype_x
123 A1 1 1
A2 0 0
456 A1 0 0
A2 2 0
You'd have to work out the details. Like renaming, merging back into whatever result dataframe you wanted. You'd also need to tweak the TimeDelta
comparison. what I have is inconsistent but you can probably take it from here. i.e.
x['date'] - x.loc[idx,'date'] < pd.Timedelta('3D')
CodePudding user response:
Annotated code
# Merge the dataframes on sid and modtype
keys = ['sid', 'modtype']
s = df2.merge(df1[[*keys, 'date']], on=keys, suffixes=['', '_'])
# Create boolean condtitions as per requirements
s['cnt_3day_after'] = s['date'].between(s['date_'], s['date_'] pd.DateOffset(days=3), inclusive='right')
s['cnt_3day_before'] = s['date'].between(s['date_'] - pd.DateOffset(days=3), s['date_'], inclusive='left' )
# group the boolean conditions by sid and modtype
# and aggregate with sum to count the number of True values
s = s.groupby(keys)[['cnt_3day_after', 'cnt_3day_before']].sum()
# Join the aggregated counts back with df1
df_out = df1.join(s, on=keys)
Result
print(df_out)
sid servid date modtype service cnt_3day_after cnt_3day_before
0 123 881 2022-07-05 A1 z 1 1
1 456 879 2022-07-02 A2 z 0 2