How to get users with 3 or more consecutive weeks in order using pandas?-CodePudding

I have a user table like this,

    USERID  Week_Number  Year
0       fb          5.0  2021
1  twitter          1.0  2021
2  twitter          2.0  2021
3  twitter          3.0  2021
4  twitter          1.0  2022
5  twitter          2.0  2022
6  twitter          3.0  2022
7  twitter         15.0  2022
8  twitter          NaN   NaN
9    human         21.0  2022

I want to find the users who login >= 3 consecutive weeks in the same year. The week numbers will be unique for each year. For example, in the above table we can see that user twitter is logged in week_no: 1, 2, 3 in the same year 2022 thereby satisfying the condition that I am looking for.

The output I am looking for,

USERID        Year
twitter       2021
twitter       2022

You can create the sample table using,

import pandas as pd
import numpy as np

data = pd.DataFrame({"USERID": ["fb", "twitter", "twitter", "twitter", "twitter", "twitter", "twitter", "twitter", "twitter", "human"],
                     "Week_Number": [5, 1, 2, 3, 1, 2, 3, 15, np.nan, 21],
                     "Year": ["2021", "2021","2021","2021", "2022", "2022", "2022", "2022", np.nan, "2022"]})

Can someone help me achieve this required output? I have tried few things but not able to arrive at proper output.

for ix, group in data.groupby([data.USERID, data.Year]):
    group = group.sort_values("Week_Number")
    group["Diff"] = (group.Week_Number - group.Week_Number.shift(1)).fillna(1)
    break

Thank you for any help in advance.

CodePudding user response：

Since you are not interested in the details of when the run of at least three weeks occurred (start or end), but only the tuples (user, year) where the user had at least three consecutive weeks of usage, then it is quite simple:

def min_consecutive(w, minimum_run=3):
    dy = w.diff() != 1
    runlen = dy.groupby(dy.cumsum()).size()
    return (runlen >= minimum_run).any()

s = (
    data
    .sort_values('Week_Number')
    .groupby(['USERID', 'Year'])['Week_Number']
    .apply(min_consecutive)
)
>>> s[s]
USERID   Year
twitter  2021    True
         2022    True
Name: Week_Number, dtype: bool

Explanation

We consider each group (user, year). In that group, we observe an (ordered, without repeats) series of week numerals. This could be e.g. [1,2,3,12,13,18,19,20,21] (a run of 3, a run of 2, and a run of 4). The Series dy shows where there were gaps in the run (e.g. for the hypothetical value above: [T,F,F,T,F,T,F,F,F]). We use the .cumsum() of that to make groups that are each a consecutive run, e.g. [1,1,1,2,2,3,3,3,3]. We take the size of each group (e.g. [3,2,4]), and return True iff any of those is at least minimum_run long.

Addendum: locate the weeks that meet the criteria

Here are some ideas, depending on how you'd like your output.

df = data.dropna().sort_values(['USERID', 'Year', 'Week_Number'])
df = df.assign(rungrp=(df.groupby(['USERID', 'Year'])['Week_Number'].diff() != 1).cumsum())
df = df.loc[df.groupby('rungrp')['rungrp'].transform('count') >= 3]
>>> df
    USERID  Week_Number  Year  rungrp
1  twitter          1.0  2021       3
2  twitter          2.0  2021       3
3  twitter          3.0  2021       3
4  twitter          1.0  2022       4
5  twitter          2.0  2022       4
6  twitter          3.0  2022       4

All the weeks that are part of a run of at least 3.

Grouping to find week min and max of each run:

>>> df.groupby(['USERID', 'Year', 'rungrp'])['Week_Number'].agg([min, max])
                     min  max
USERID  Year rungrp          
twitter 2021 3       1.0  3.0
        2022 4       1.0  3.0

CodePudding user response：

Instead of looping you can create a column which will show whether a user in a year has had a consecutive increase, and then check if that column sums to more than 3 per user in a year:

data.sort_values(by=['USERID','Year','Week_Number'],ascending=True,inplace=True)

data.assign(
    grouped_increase = data.groupby([data.USERID, data.Year])["Week_Number"]
    .diff()
    .gt(0)
    .astype(int)
).groupby([data.USERID, data.Year])["grouped_increase"].sum().reset_index().query(
    "grouped_increase >= 3"
).drop(
    "grouped_increase", axis=1
)

    USERID  Year
3  twitter  2022

Based on your comment, using this DF:

     USERID  Week_Number    Year
8        fb          2.0  2021.0
9        fb          3.0  2021.0
10       fb          4.0  2021.0
0        fb          5.0  2021.0
11       fb          2.0  2022.0
12       fb          3.0  2022.0
13       fb          4.0  2022.0
14       fb          5.0  2022.0
7     human         21.0  2022.0
1   twitter          1.0  2021.0
2   twitter          1.0  2022.0
3   twitter          2.0  2022.0
4   twitter          3.0  2022.0
5   twitter         15.0  2022.0
6   twitter          NaN     NaN

Running the above code gives:

    USERID    Year
0       fb  2021.0
1       fb  2022.0
4  twitter  2022.0