Keep observations with two or more consecutive years of data by group-CodePudding

I have a dataset consisting of directorid, match_id, and calyear. I would like to keep only observations by director_id and match_id that have at least 2 consecutive years of data. I have tried a few different ways to do this, and haven't been able to get it quite right. The few different things I have tried have also required multiple steps and weren't particularly clean.

Here is what I have:

director_id	match_id	calyear
282	1111	2006
282	1111	2007
356	2222	2005
356	2222	2007
600	3333	2010
600	3333	2011
600	3333	2012
600	3355	2013
600	3355	2015
600	3355	2016
753	4444	2005
753	4444	2008
753	4444	2009

Here is what I want:

director_id	match_id	calyear
282	1111	2006
282	1111	2007
600	3333	2010
600	3333	2011
600	3333	2012
600	3355	2015
600	3355	2016
753	4444	2008
753	4444	2009

I started by creating a variable equal to one:

df['tosum'] = 1

And then count the number of observations where the difference in calyear by group is equal to 1.

df['num_years'] = (
df.groupby(['directorid','match_id'])['tosum'].transform('sum').where(df.groupby(['match_id'])['calyear'].diff()==1, np.nan)
                             )

And then I keep all observations with 'num_years' greater than 1.

However, the first observation per director_id match_id gets set equal to NaN. In general, I think I am going about this in a convoluted way...it feels like there should be a simpler way to achieve my goal. Any help is greatly appreciated!

CodePudding user response：

Yes you need to groupby 'director_id', 'match_id' and then do a transform but the transform just needs to look at the difference between next element in both directions. In one direction you need to see if it equals 1 and in another -1 and then subset using the resulting True/False values.

df = df[
    df.groupby(["director_id", "match_id"])["calyear"].transform(
        lambda x: (x.diff().eq(1)) | (x[::-1].diff().eq(-1))
    )
]

print(df):

    director_id  match_id  calyear
0           282      1111     2006
1           282      1111     2007
4           600      3333     2010
5           600      3333     2011
6           600      3333     2012
8           600      3355     2015
9           600      3355     2016
11          753      4444     2008
12          753      4444     2009