Home > Net >  Is there a way in pandas to count the number of duplicates across years?
Is there a way in pandas to count the number of duplicates across years?

Time:01-24

Not exactly sure how to ask the question but I can show a clear before and after that would help.

My data looks like this:

temp_df = pd.DataFrame({'userID': [1, 3, 4, 3, 4, 3], 
                        'Year': [2020, 2020, 2021, 2021, 2022, 2022]})
temp_df
userID Year
1 2020
3 2020
4 2021
3 2021
4 2022
3 2022

I want to create a rolling count of userIDs so I can see which userIDs are still around and for how long. So it should end up being something like this:

Year Number of ids from last year Number of ids from 2 years ago
2020 0 0
2021 1 0
2022 2 1

and so on and so forth.

I don't know how to go about solving this so I thought it better to ask.

CodePudding user response:

What you want differs from what you asked, so I try to address what you want first:

df_analyze = temp_df.groupby(['userID']).Year.agg(['min', 'max'])
df_analyze['duration'] = df_analyze['max']-df_analyze['min']
print(df_analyze['duration'])

userID
1    0
3    2
4    1

You can group by user id and get the min and max year of a userid, the difference is how long the user stayed.

CodePudding user response:

Use sets to keep track of the seen IDs:

s = temp_df.groupby('Year')['userID'].agg(set)

N = 2
df = (pd.DataFrame({f'n-{i}': [len(a&b) for a,b in zip(s, s.shift(i, fill_value=set()))]
                    for i in range(1, N 1)}, index=s.index)
        .reset_index()
      )

print(df)

Output:

   Year  n-1  n-2
0  2020    0    0
1  2021    1    0
2  2022    2    1
  • Related