Not exactly sure how to ask the question but I can show a clear before and after that would help.
My data looks like this:
temp_df = pd.DataFrame({'userID': [1, 3, 4, 3, 4, 3],
'Year': [2020, 2020, 2021, 2021, 2022, 2022]})
temp_df
userID | Year |
---|---|
1 | 2020 |
3 | 2020 |
4 | 2021 |
3 | 2021 |
4 | 2022 |
3 | 2022 |
I want to create a rolling count of userIDs so I can see which userIDs are still around and for how long. So it should end up being something like this:
Year | Number of ids from last year | Number of ids from 2 years ago |
---|---|---|
2020 | 0 | 0 |
2021 | 1 | 0 |
2022 | 2 | 1 |
and so on and so forth.
I don't know how to go about solving this so I thought it better to ask.
CodePudding user response:
What you want differs from what you asked, so I try to address what you want first:
df_analyze = temp_df.groupby(['userID']).Year.agg(['min', 'max'])
df_analyze['duration'] = df_analyze['max']-df_analyze['min']
print(df_analyze['duration'])
userID
1 0
3 2
4 1
You can group by user id and get the min and max year of a userid, the difference is how long the user stayed.
CodePudding user response:
Use sets to keep track of the seen IDs:
s = temp_df.groupby('Year')['userID'].agg(set)
N = 2
df = (pd.DataFrame({f'n-{i}': [len(a&b) for a,b in zip(s, s.shift(i, fill_value=set()))]
for i in range(1, N 1)}, index=s.index)
.reset_index()
)
print(df)
Output:
Year n-1 n-2
0 2020 0 0
1 2021 1 0
2 2022 2 1