I have a user table like this,
USERID Week_Number Year
0 fb 5.0 2021
1 twitter 1.0 2021
2 twitter 2.0 2021
3 twitter 3.0 2021
4 twitter 1.0 2022
5 twitter 2.0 2022
6 twitter 3.0 2022
7 twitter 15.0 2022
8 twitter NaN NaN
9 human 21.0 2022
I want to find the users who login >= 3 consecutive weeks in the same year. The week numbers will be unique for each year. For example, in the above table we can see that user twitter is logged in week_no: 1, 2, 3 in the same year 2022 thereby satisfying the condition that I am looking for.
The output I am looking for,
USERID Year
twitter 2021
twitter 2022
You can create the sample table using,
import pandas as pd
import numpy as np
data = pd.DataFrame({"USERID": ["fb", "twitter", "twitter", "twitter", "twitter", "twitter", "twitter", "twitter", "twitter", "human"],
"Week_Number": [5, 1, 2, 3, 1, 2, 3, 15, np.nan, 21],
"Year": ["2021", "2021","2021","2021", "2022", "2022", "2022", "2022", np.nan, "2022"]})
Can someone help me achieve this required output? I have tried few things but not able to arrive at proper output.
for ix, group in data.groupby([data.USERID, data.Year]):
group = group.sort_values("Week_Number")
group["Diff"] = (group.Week_Number - group.Week_Number.shift(1)).fillna(1)
break
Thank you for any help in advance.
CodePudding user response:
Since you are not interested in the details of when the run of at least three weeks occurred (start or end), but only the tuples (user, year)
where the user had at least three consecutive weeks of usage, then it is quite simple:
def min_consecutive(w, minimum_run=3):
dy = w.diff() != 1
runlen = dy.groupby(dy.cumsum()).size()
return (runlen >= minimum_run).any()
s = (
data
.sort_values('Week_Number')
.groupby(['USERID', 'Year'])['Week_Number']
.apply(min_consecutive)
)
>>> s[s]
USERID Year
twitter 2021 True
2022 True
Name: Week_Number, dtype: bool
Explanation
We consider each group (user, year)
. In that group, we observe an (ordered, without repeats) series of week numerals. This could be e.g. [1,2,3,12,13,18,19,20,21]
(a run of 3, a run of 2, and a run of 4). The Series dy
shows where there were gaps in the run (e.g. for the hypothetical value above: [T,F,F,T,F,T,F,F,F]
). We use the .cumsum()
of that to make groups that are each a consecutive run, e.g. [1,1,1,2,2,3,3,3,3]
. We take the size
of each group (e.g. [3,2,4]
), and return True
iff any of those is at least minimum_run
long.
Addendum: locate the weeks that meet the criteria
Here are some ideas, depending on how you'd like your output.
df = data.dropna().sort_values(['USERID', 'Year', 'Week_Number'])
df = df.assign(rungrp=(df.groupby(['USERID', 'Year'])['Week_Number'].diff() != 1).cumsum())
df = df.loc[df.groupby('rungrp')['rungrp'].transform('count') >= 3]
>>> df
USERID Week_Number Year rungrp
1 twitter 1.0 2021 3
2 twitter 2.0 2021 3
3 twitter 3.0 2021 3
4 twitter 1.0 2022 4
5 twitter 2.0 2022 4
6 twitter 3.0 2022 4
All the weeks that are part of a run of at least 3.
Grouping to find week min and max of each run:
>>> df.groupby(['USERID', 'Year', 'rungrp'])['Week_Number'].agg([min, max])
min max
USERID Year rungrp
twitter 2021 3 1.0 3.0
2022 4 1.0 3.0
CodePudding user response:
Instead of looping you can create a column which will show whether a user in a year has had a consecutive increase, and then check if that column sums to more than 3 per user in a year:
data.sort_values(by=['USERID','Year','Week_Number'],ascending=True,inplace=True)
data.assign(
grouped_increase = data.groupby([data.USERID, data.Year])["Week_Number"]
.diff()
.gt(0)
.astype(int)
).groupby([data.USERID, data.Year])["grouped_increase"].sum().reset_index().query(
"grouped_increase >= 3"
).drop(
"grouped_increase", axis=1
)
USERID Year
3 twitter 2022
Based on your comment, using this DF
:
USERID Week_Number Year
8 fb 2.0 2021.0
9 fb 3.0 2021.0
10 fb 4.0 2021.0
0 fb 5.0 2021.0
11 fb 2.0 2022.0
12 fb 3.0 2022.0
13 fb 4.0 2022.0
14 fb 5.0 2022.0
7 human 21.0 2022.0
1 twitter 1.0 2021.0
2 twitter 1.0 2022.0
3 twitter 2.0 2022.0
4 twitter 3.0 2022.0
5 twitter 15.0 2022.0
6 twitter NaN NaN
Running the above code gives:
USERID Year
0 fb 2021.0
1 fb 2022.0
4 twitter 2022.0