Is there a way in pandas to count the number of duplicates across years?-CodePudding

Not exactly sure how to ask the question but I can show a clear before and after that would help.

My data looks like this:

temp_df = pd.DataFrame({'userID': [1, 3, 4, 3, 4, 3], 
                        'Year': [2020, 2020, 2021, 2021, 2022, 2022]})
temp_df

userID	Year
1	2020
3	2020
4	2021
3	2021
4	2022
3	2022

I want to create a rolling count of userIDs so I can see which userIDs are still around and for how long. So it should end up being something like this:

Year	Number of ids from last year	Number of ids from 2 years ago
2020	0	0
2021	1	0
2022	2	1

and so on and so forth.

I don't know how to go about solving this so I thought it better to ask.

CodePudding user response：

What you want differs from what you asked, so I try to address what you want first:

df_analyze = temp_df.groupby(['userID']).Year.agg(['min', 'max'])
df_analyze['duration'] = df_analyze['max']-df_analyze['min']
print(df_analyze['duration'])

userID
1    0
3    2
4    1

You can group by user id and get the min and max year of a userid, the difference is how long the user stayed.

CodePudding user response：

Use sets to keep track of the seen IDs:

s = temp_df.groupby('Year')['userID'].agg(set)

N = 2
df = (pd.DataFrame({f'n-{i}': [len(a&b) for a,b in zip(s, s.shift(i, fill_value=set()))]
                    for i in range(1, N 1)}, index=s.index)
        .reset_index()
      )

print(df)

Output:

   Year  n-1  n-2
0  2020    0    0
1  2021    1    0
2  2022    2    1