Efficiently count occurrences of a value with groupby and within a date range-CodePudding

I have code that can do this, but I am iterating through each row of the dataframe with iterrows(). It takes quite a long time to process considering it's checking through over 6M rows. And want to use vectorisation to speed it up.

I've looked at using pd.Grouper and freq, but have gotten stuck on how to use the 2 dataframes to do this check with that.

Given the 2 dataframes below:

I want to look at all rows in df1 (grouped by 'sid' and 'modtype'):

df1:

   sid servid       date modtype service
0  123    881 2022-07-05      A1       z
1  456    879 2022-07-02      A2       z

Then find them in df2 and count the occurrences of those groups within 3 days of the date of that group in df1, to get a count of how many times that group comes within 3 days before, and a count of occurrences it comes within 3 days after.

df2:

    sid servid       date modtype
0   123   1234 2022-07-03      A1
1   123    881 2022-07-05      A1
2   123  65781 2022-07-06      A1
3   123   8552 2022-07-30      A1
4   123   3453 2022-07-04      A2
5   123   5681 2022-07-07      A2
6   456     78 2022-07-01      A1
7   456  26744 2022-05-05      A2
8   456  56166 2022-06-29      A2
9   456  56717 2022-06-30      A2
10  456    879 2022-07-02      A2
11  456     56 2022-07-25      A2

So, essentially, in the sample set which I provide below, my output would end up with:

   sid servid       date modtype service  cnt_3day_before   cnt_3day_after
0  123    881 2022-07-05      A1       z    1                 1
1  456    879 2022-07-02      A2       z    2                 0

Sample set:

import pandas as pd

data1 = {
    'sid':['123','456'],
    'servid':['881','879'],
    'date':['2022-07-05','2022-07-02'],
    'modtype':['A1','A2'],
    'service':['z','z']}

df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df1 = df1.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)



data2 = {
        'sid':['123','123','123','123','123','123',
               '456','456','456','456','456','456'],
        'servid':['1234','3453','881','65781','5681','8552',
                  '26744','56717','879','56166','56','78'],
        'date':['2022-07-03','2022-07-04','2022-07-05','2022-07-06','2022-07-07','2022-07-30',
                '2022-05-05','2022-06-30','2022-07-02','2022-06-29','2022-07-25','2022-07-01'],
        'modtype':['A1','A2','A1','A1','A2','A1',
                   'A2','A2','A2','A2','A2','A1']}

df2 = pd.DataFrame(data2)
df2['date'] = pd.to_datetime(df2['date'])
df2 = df2.sort_values(by=['sid','modtype','date'], ascending=[True, True, True]).reset_index(drop=True)

CodePudding user response：

I think there are definitely exist faster solutions, but you can try this one. It iterates over "queries" in df1 and for each query computes number of events in df2 that happened before and after 3 days. To calculate number of such events we first set sid and modtype to be an index column, then we select matching events by index and calculate time difference between the selected events and query, then we just count ones that happened in /- 3 days. This place can be optimized with binary search to give you O(logN) instead of O(N) complexity in case you have sorted date column.

df2 = df2.set_index(['sid', 'modtype'])
seconds_in_3days = 3*24*60*60
    
def before_and_after_3days(query):
    dates = df2.loc[tuple(query[['sid', 'modtype']]), 'date']
    secs = (dates - query['date']).dt.total_seconds().astype(int)
    before = ((-seconds_in_3days <= secs) & (secs < 0)).sum()
    after = ((0 < secs) & (secs < seconds_in_3days)).sum()
    return before, after
    
before_after = df1.apply(before_and_after_3days, axis=1)
df1[['cnt_3day_before', 'cnt_3day_after']] = before_after.tolist()

CodePudding user response：

Here is a partial solution. No time to do the full thing. Might have time later. But thought I'd pass along the idea in case it might help you get moving in the right direction.

def a(x):
    s = x['sid_y'].isna()
    
    if s.all():
        return pd.Series([0,0], index=['before','after'])
    
    idx = (~s).idxmax()
    
    nb_before = ((x.loc[idx,'date'] > x['date']) & (x.loc[idx,'date'] - x['date'] <= pd.Timedelta('3D'))).sum()
    nb_after = ((x.loc[idx,'date'] < x['date']) & (x['date'] - x.loc[idx,'date'] < pd.Timedelta('3D'))).sum()
    
    return pd.Series([nb_before,nb_after], index=['before','after'])
    
df2.merge(df1, how='left', on='date').groupby(['sid_x','modtype_x']).apply(a)

Result

                 before  after
sid_x modtype_x               
123   A1              1      1
      A2              0      0
456   A1              0      0
      A2              2      0

You'd have to work out the details. Like renaming, merging back into whatever result dataframe you wanted. You'd also need to tweak the TimeDelta comparison. what I have is inconsistent but you can probably take it from here. i.e.

x['date'] - x.loc[idx,'date'] < pd.Timedelta('3D')

CodePudding user response：

Annotated code

# Merge the dataframes on sid and modtype
keys = ['sid', 'modtype']
s = df2.merge(df1[[*keys, 'date']], on=keys, suffixes=['', '_'])

# Create boolean condtitions as per requirements
s['cnt_3day_after']  = s['date'].between(s['date_'], s['date_']   pd.DateOffset(days=3), inclusive='right')
s['cnt_3day_before'] = s['date'].between(s['date_'] - pd.DateOffset(days=3), s['date_'], inclusive='left' )

# group the boolean conditions by sid and modtype
# and aggregate with sum to count the number of True values
s = s.groupby(keys)[['cnt_3day_after', 'cnt_3day_before']].sum()

# Join the aggregated counts back with df1
df_out = df1.join(s, on=keys)

Result

print(df_out)

   sid servid       date modtype service  cnt_3day_after  cnt_3day_before
0  123    881 2022-07-05      A1       z               1                1
1  456    879 2022-07-02      A2       z               0                2