Count rolling window unique values with duplicate dates pandas-CodePudding

If I have a pandas DataFrame like this

date	person_active
22/2	John
22/2	Marie
22/2	Mark
23/2	John
24/2	Mark
24/2	Marie

how do I count in a rolling window based on time the unique values in person_active, for example: 2 days rolling window, so it ends up like this:

date	person_active	people_active
22/2	John	3
22/2	Marie	3
22/2	Mark	3
23/2	John	3
24/2	Mark	3
24/2	Marie	3

The main issue here is that I have duplicate entries on date for each person so a simple df.rolling('2d',on='date').count() won't do the job.

EDIT: Please consider implementation in a big dataset and how the time to compute will scale, the solution needs to be ideally applicable in a real-world environment so if it takes too long to compute it's not that useful.

CodePudding user response：

IIUC, try:

#convert to datetime if needed
df["date"] = pd.to_datetime(df["date"], format="%d/%m")

#convert string name to categorical codes for numerical aggegation
df["people"] = pd.Categorical(df["person_active"]).codes

#compute the rolling unique count
df["people_active"] = (df.rolling("2D", on="date")["people"]
                         .agg(lambda x: x.nunique())
                         .groupby(df["date"])
                         .transform("max")
                       )

#drop the unneccessary column
df = df.drop("people", axis=1)

>>> df
        date person_active  people_active
0 1900-02-22          John            3.0
1 1900-02-22         Marie            3.0
2 1900-02-22          Mark            3.0
3 1900-02-23          John            3.0
4 1900-02-24          Mark            3.0
5 1900-02-24         Marie            3.0

CodePudding user response：

Group by date, count unique values and then you're good to go:

df.groupby('date').nunique().rolling('2d').sum()

CodePudding user response：

Since rolling doesn't work for string variables, we could assign a numerical value to each person with groupby ngroup; then use rolling apply(nunique) on the new numerical column:

df['date'] = pd.to_datetime(df['date'], format='%d/%m')
df['people_active'] = (df.assign(group=df.groupby('person_active').ngroup())
                       .rolling('2d', on='date')['group'].apply(lambda x: x.nunique())
                       .groupby(df["date"]).transform('max'))
df['date'] = df['date'].dt.strftime('%d/%m')

Output:

    date person_active  people_active
0  22/02          John            3.0
1  22/02         Marie            3.0
2  22/02          Mark            3.0
3  23/02          John            3.0
4  24/02          Mark            3.0
5  24/02         Marie            3.0