Home > OS >  Keep observations with two or more consecutive years of data by group
Keep observations with two or more consecutive years of data by group

Time:12-03

I have a dataset consisting of directorid, match_id, and calyear. I would like to keep only observations by director_id and match_id that have at least 2 consecutive years of data. I have tried a few different ways to do this, and haven't been able to get it quite right. The few different things I have tried have also required multiple steps and weren't particularly clean.

Here is what I have:

director_id match_id calyear
282 1111 2006
282 1111 2007
356 2222 2005
356 2222 2007
600 3333 2010
600 3333 2011
600 3333 2012
600 3355 2013
600 3355 2015
600 3355 2016
753 4444 2005
753 4444 2008
753 4444 2009

Here is what I want:

director_id match_id calyear
282 1111 2006
282 1111 2007
600 3333 2010
600 3333 2011
600 3333 2012
600 3355 2015
600 3355 2016
753 4444 2008
753 4444 2009

I started by creating a variable equal to one:

df['tosum'] = 1

And then count the number of observations where the difference in calyear by group is equal to 1.

df['num_years'] = (
df.groupby(['directorid','match_id'])['tosum'].transform('sum').where(df.groupby(['match_id'])['calyear'].diff()==1, np.nan)
                             )

And then I keep all observations with 'num_years' greater than 1.

However, the first observation per director_id match_id gets set equal to NaN. In general, I think I am going about this in a convoluted way...it feels like there should be a simpler way to achieve my goal. Any help is greatly appreciated!

CodePudding user response:

Yes you need to groupby 'director_id', 'match_id' and then do a transform but the transform just needs to look at the difference between next element in both directions. In one direction you need to see if it equals 1 and in another -1 and then subset using the resulting True/False values.

df = df[
    df.groupby(["director_id", "match_id"])["calyear"].transform(
        lambda x: (x.diff().eq(1)) | (x[::-1].diff().eq(-1))
    )
]

print(df):

    director_id  match_id  calyear
0           282      1111     2006
1           282      1111     2007
4           600      3333     2010
5           600      3333     2011
6           600      3333     2012
8           600      3355     2015
9           600      3355     2016
11          753      4444     2008
12          753      4444     2009
  • Related