I have the following dataframe (this is a simplified version of the dataframe, but the logic is the same):
#MONTH = yyyy-mm-dd
MONTH User
0 2021-04-01 A
1 2021-04-01 B
2 2021-05-01 B
3 2021-06-01 A
4 2021-06-01 B
5 2021-07-01 A
6 2021-07-01 B
7 2021-08-01 A
8 2021-08-01 B
What I want is to compute whether a user was active on a 3 month rolling basis.
For example, User B
if we consider June (2021-06-01) we can see he was active in May and in April, thus on a 3M rolling basis he is considered Active in June. Whereas User A
for the same time period, was not active in one of the three months, thus in June he will not be considered active.
A desired output would be to have a column which counts the Active Users (3m rolling) for each month, for example based on the above data:
MONTH Active_User_Count
0 2021-04-01 NaN
1 2021-05-01 NaN
2 2021-06-01 1
3 2021-07-01 1
4 2021-08-01 2
I'm still trying to get my head around rolling data, so if anyone could help me on this that would be great! Thanks in advance!
EDIT The MONTH
column only has values for the 1st day of each month, but has multiple Users for that day. So there is no 2021-04-30, it all is on a monthly basis on the first day of the month.
CodePudding user response:
Aight, let's try this.
Assuming a pandas.DataFrame
called df
, that has a MONTH
column of type pandas.Timestamp
, and a User
column that we can groupby
:
import pandas as pd
import numpy as np
df = #[however you got your data here]
df.MONTH = df.MONTH.apply(pd.Timestamp)
so that e.g.
>>> df
MONTH User
0 2021-04-01 A
1 2021-04-01 B
2 2021-05-01 B
3 2021-06-01 A
4 2021-06-01 B
5 2021-07-01 A
6 2021-07-01 B
7 2021-08-01 A
8 2021-08-01 B
Then given the above, let's make a DataFrame to hold our result, with consecutive months from the beginning to the end of the input DataFrame
, and initialize the active user count column to 0:
res = pd.DataFrame(pd.date_range(df.MONTH.min(),df.MONTH.max(),freq='MS'),columns=['MONTH'])
res['Active_User_Count'] = 0
res = res.set_index('MONTH').sort_index()
Now to add in the values:
for user, frame in df.groupby(by='User'):
# make a helper column, that has an indicator of whether the user
# was active that month (value='both') or not (value='right_only')
frame = frame.merge(
pd.Series(pd.date_range(start=frame.MONTH.min(),\
end=frame.MONTH.max(),\
freq='MS'),\
name='MONTH'),\
on='MONTH',how='outer',indicator=True)\
.set_index('MONTH').sort_index()
# this is where the magic happens;
# categorize the '_merge' results (0 = left_only, 1 = right_only, 2 = both)
# then on a 3-wide rolling window, get the minimum value
# check that it is greater than 1.5 (i.e. all three prev months
# are _merge value 'both')
# if it's not > 1.5, then the user wasn't active for all 3 months
# finally take the result from that rolling.min.apply,
# and funnel into a numpy.where array, which sets
# 'Active_User_Count' of the in-process user frame
# to an array of 1s and 0s
frame['Active_User_Count'] = np.where(
(frame._merge
.astype('category').cat.codes
.rolling(3).min().apply(lambda x: x > 1.5)), 1, 0)
# add the current-user activity into the total result
res.Active_User_Count[frame.index] = frame.Active_User_Count
# some re-formatting
res = res.reset_index().sort_index()
And after all that we get our output:
>>> res
MONTH Active_User_Count
0 2021-04-01 0
1 2021-05-01 0
2 2021-06-01 1
3 2021-07-01 1
4 2021-08-01 2
TL;DR
Here's a function to do the thing
import pandas as pd
import numpy as np
def active_users(df):
res = pd.DataFrame(pd.date_range(df.MONTH.min(),\
df.MONTH.max(),\
freq='MS'),\
columns=['MONTH'])
res['Active_User_Count'] = 0
res = res.set_index('MONTH').sort_index()
for user, frame in df.groupby(by='User'):
frame = frame.merge(pd.Series(
pd.date_range(start=frame.MONTH.min(),\
end=frame.MONTH.max(),\
freq='MS'),\
name='MONTH'),\
on='MONTH',\
how='outer',\
indicator=True)\
.set_index('MONTH').sort_index()
frame['Active_User_Count'] = np.where(
(frame._merge
.astype('category')
.cat.codes
.rolling(3).min().apply(lambda x: x > 1.5)), 1, 0)
res.Active_User_Count[frame.index] = frame.Active_User_Count
return res.reset_index().sort_index()