If I have a pandas DataFrame like this
date | person_active |
---|---|
22/2 | John |
22/2 | Marie |
22/2 | Mark |
23/2 | John |
24/2 | Mark |
24/2 | Marie |
how do I count in a rolling window based on time the unique values in person_active
, for example: 2 days rolling window, so it ends up like this:
date | person_active | people_active |
---|---|---|
22/2 | John | 3 |
22/2 | Marie | 3 |
22/2 | Mark | 3 |
23/2 | John | 3 |
24/2 | Mark | 3 |
24/2 | Marie | 3 |
The main issue here is that I have duplicate entries on date
for each person so a simple df.rolling('2d',on='date').count()
won't do the job.
EDIT: Please consider implementation in a big dataset and how the time to compute will scale, the solution needs to be ideally applicable in a real-world environment so if it takes too long to compute it's not that useful.
CodePudding user response:
IIUC, try:
#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"], format="%d/%m")
#convert string name to categorical codes for numerical aggegation
df["people"] = pd.Categorical(df["person_active"]).codes
#compute the rolling unique count
df["people_active"] = (df.rolling("2D", on="date")["people"]
.agg(lambda x: x.nunique())
.groupby(df["date"])
.transform("max")
)
#drop the unneccessary column
df = df.drop("people", axis=1)
>>> df
date person_active people_active
0 1900-02-22 John 3.0
1 1900-02-22 Marie 3.0
2 1900-02-22 Mark 3.0
3 1900-02-23 John 3.0
4 1900-02-24 Mark 3.0
5 1900-02-24 Marie 3.0
CodePudding user response:
Group by date, count unique values and then you're good to go:
df.groupby('date').nunique().rolling('2d').sum()
CodePudding user response:
Since rolling
doesn't work for string variables, we could assign a numerical value to each person with groupby
ngroup
; then use rolling
apply(nunique)
on the new numerical column:
df['date'] = pd.to_datetime(df['date'], format='%d/%m')
df['people_active'] = (df.assign(group=df.groupby('person_active').ngroup())
.rolling('2d', on='date')['group'].apply(lambda x: x.nunique())
.groupby(df["date"]).transform('max'))
df['date'] = df['date'].dt.strftime('%d/%m')
Output:
date person_active people_active
0 22/02 John 3.0
1 22/02 Marie 3.0
2 22/02 Mark 3.0
3 23/02 John 3.0
4 24/02 Mark 3.0
5 24/02 Marie 3.0