I have a dataset consisting of directorid, match_id, and calyear. I would like to keep only observations by director_id and match_id that have at least 2 consecutive years of data. I have tried a few different ways to do this, and haven't been able to get it quite right. The few different things I have tried have also required multiple steps and weren't particularly clean.
Here is what I have:
director_id | match_id | calyear |
---|---|---|
282 | 1111 | 2006 |
282 | 1111 | 2007 |
356 | 2222 | 2005 |
356 | 2222 | 2007 |
600 | 3333 | 2010 |
600 | 3333 | 2011 |
600 | 3333 | 2012 |
600 | 3355 | 2013 |
600 | 3355 | 2015 |
600 | 3355 | 2016 |
753 | 4444 | 2005 |
753 | 4444 | 2008 |
753 | 4444 | 2009 |
Here is what I want:
director_id | match_id | calyear |
---|---|---|
282 | 1111 | 2006 |
282 | 1111 | 2007 |
600 | 3333 | 2010 |
600 | 3333 | 2011 |
600 | 3333 | 2012 |
600 | 3355 | 2015 |
600 | 3355 | 2016 |
753 | 4444 | 2008 |
753 | 4444 | 2009 |
I started by creating a variable equal to one:
df['tosum'] = 1
And then count the number of observations where the difference in calyear by group is equal to 1.
df['num_years'] = (
df.groupby(['directorid','match_id'])['tosum'].transform('sum').where(df.groupby(['match_id'])['calyear'].diff()==1, np.nan)
)
And then I keep all observations with 'num_years' greater than 1.
However, the first observation per director_id match_id gets set equal to NaN. In general, I think I am going about this in a convoluted way...it feels like there should be a simpler way to achieve my goal. Any help is greatly appreciated!
CodePudding user response:
Yes you need to groupby 'director_id', 'match_id' and then do a transform but the transform just needs to look at the difference between next element in both directions. In one direction you need to see if it equals 1 and in another -1 and then subset using the resulting True/False values.
df = df[
df.groupby(["director_id", "match_id"])["calyear"].transform(
lambda x: (x.diff().eq(1)) | (x[::-1].diff().eq(-1))
)
]
print(df):
director_id match_id calyear
0 282 1111 2006
1 282 1111 2007
4 600 3333 2010
5 600 3333 2011
6 600 3333 2012
8 600 3355 2015
9 600 3355 2016
11 753 4444 2008
12 753 4444 2009