Home > front end >  Pandas: Rolling window to count the frequency - Fastest approach
Pandas: Rolling window to count the frequency - Fastest approach

Time:04-21

I would like to count the frequency of a value for the past x days. In the example below, I would like to count the frequency of value in the Name column for the past 28 days. The data is already sorted by Date

import pandas as pd
import time

d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'],
     'Date': ['08/01/2021',
              '27/01/2021',
              '05/02/2021',
              '10/02/2021',
              '17/02/2021',
              '18/02/2021',
              '20/02/2021',
              '21/02/2021',
              '22/02/2021',
              '29/03/2021']}
df = pd.DataFrame(data=d)
df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y')
# Make sure pandas is sorted by Date
df = df.sort_values('Date')

I found some solutions on StackOverFlow but all of them are neither correct on the dataset nor fast.

Approach 1 - not quite correct

df['count1'] = df.set_index('Date').groupby('Name', sort=False)['Name'].rolling('28d', closed='both').count().tolist()

Approach 2 - correct approach but very slow <~ from this link

df['count2'] = df.assign(count=1).groupby(['Name']).apply(lambda x: x.rolling('28d', on='Date').sum())['count']

Approach 3 - using sum - Not correct

df['count3'] = df.assign(count=1).groupby('Name').rolling('28d', on='Date').sum().reset_index().sort_values('Date')['count']

Approach4 - also using sum - not correct as the indexes are not right <~ this link

df['count4'] = df.set_index('Date').assign(count_last=1).groupby('Name').rolling('28d').sum().reset_index()["count_last"]

Output

   Name       Date  count1  count2  count3  count4
0  Jack 2021-01-08     1.0     1.0     1.0     1.0
1   Jim 2021-01-27     2.0     1.0     1.0     1.0
2  Jack 2021-02-05     2.0     1.0     2.0     2.0
3   Jim 2021-02-10     3.0     2.0     3.0     3.0
4  Jack 2021-02-17     4.0     2.0     4.0     4.0 #<~ all are wrong here except approach 2
5  Jack 2021-02-18     1.0     3.0     1.0     1.0
6   Jim 2021-02-20     2.0     3.0     1.0     1.0
7  Jack 2021-02-21     3.0     4.0     1.0     1.0
8  Jane 2021-02-22     1.0     1.0     2.0     2.0
9  Jane 2021-03-29     1.0     1.0     3.0     3.0

Performances

Method 1: 0.0014538764953613281 ms
Method 2: 0.0034720897674560547 ms
Method 3: 0.002077817916870117 ms
Method 4: 0.0035729408264160156 ms

CodePudding user response:

IIUC, the issue is coming from your tolist() that messes up with index alignment and shuffles the output.

Use a merge instead:

df2 = (df
 .merge(df.set_index('Date')
          .groupby('Name', sort=False)['Name']
          .rolling('28d', closed='both') # do you really want closed="both"?
          .count().rename('count'),
        left_on=['Name', 'Date'], right_index=True
        )
)

output:

   Name       Date  count
0  Jack 2021-01-08    1.0
1   Jim 2021-01-27    1.0
2  Jack 2021-02-05    2.0 <- if you want 1 here, remove closed='both'
3   Jim 2021-02-10    2.0
4  Jack 2021-02-17    2.0
5  Jack 2021-02-18    3.0
6   Jim 2021-02-20    3.0
7  Jack 2021-02-21    4.0
8  Jane 2021-02-22    1.0
9  Jane 2021-03-29    1.0

CodePudding user response:

Solution with DataFrame.join for append Series with MultiIndex :

df = df.join(df.set_index('Date').groupby('Name')['Name'].rolling('28d', closed='both').count().rename('count'), on=['Name', 'Date'])
print (df)
   Name       Date  count
0  Jack 2021-01-08    1.0
1   Jim 2021-01-27    1.0
2  Jack 2021-02-05    2.0
3   Jim 2021-02-10    2.0
4  Jack 2021-02-17    2.0
5  Jack 2021-02-18    3.0
6   Jim 2021-02-20    3.0
7  Jack 2021-02-21    4.0
8  Jane 2021-02-22    1.0
9  Jane 2021-03-29    1.0
  • Related