Home > Software engineering >  Filter rows with more than X entries per year
Filter rows with more than X entries per year

Time:01-16

I have a df with IDs and Dates (. Here is an example:

ID  Date
1   26.04.2011
1   21.10.2011
14  25.02.2010
14  08.07.2010
14  20.10.2010
14  07.01.2011
14  20.04.2011
14  02.07.2011
14  11.10.2011
14  23.01.2012
14  19.04.2012
14  22.10.2012
14  15.01.2013
14  06.05.2013
18  23.11.2012
18  05.06.2013
18  19.08.2013
18  11.04.2014
18  18.07.2014

ID            object
Date     datetime64[ns]

I want to keep only those rows where there are =< 3 Dates per year per ID. So the result should be:

ID  Date
14  25.02.2010
14  08.07.2010
14  20.10.2010
14  07.01.2011
14  20.04.2011
14  02.07.2011
14  11.10.2011
14  23.01.2012
14  19.04.2012
14  22.10.2012

I tried groupby and size:

            ID  year  size
0            1  2011     2
1           14  2010     3
2           14  2011     4

However this is not what I want.

CodePudding user response:

Use GroupBy.transform per ID and years with count by GroupBy.size, compare for greater or equal by Series.ge and filter in boolean indexing:

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)

df = df[df.groupby(['ID',df['Date'].dt.year])['ID'].transform('size').ge(3)]
print (df)
    ID       Date
2   14 2010-02-25
3   14 2010-07-08
4   14 2010-10-20
5   14 2011-01-07
6   14 2011-04-20
7   14 2011-07-02
8   14 2011-10-11
9   14 2012-01-23
10  14 2012-04-19
11  14 2012-10-22
  • Related