Home > Software engineering >  How to extract ids and rows only if a columns has all of the designated values
How to extract ids and rows only if a columns has all of the designated values

Time:05-02

I have the following dataframe

id  date     other variables..  
A   2019Q4      
A   2020Q4        
A   2021Q4 
B   2018Q4
B   2019Q4
B   2020Q4
B   2021Q4
C   2020Q4
C   2021Q4
D   2021Q4
E   2018Q4
E   2019Q4
E   2020Q4
E   2021Q4
.       .      

I want to group by id and keep those ids if it contains all of the designated values (i.e. 2019Q4, 2020Q4, 2021Q4) then extract rows that correspond to those values. isin() won't work because it won't drop C and D.

desired output

A   2019Q4      
A   2020Q4        
A   2021Q4 
B   2019Q4
B   2020Q4
B   2021Q4
E   2019Q4
E   2020Q4
E   2021Q4
.       .      

CodePudding user response:

You can use set operations to filter the id and isin for the date:

target = {'2019Q4', '2020Q4', '2021Q4'}

id_ok = df.groupby('id')['date'].agg(lambda x: target.issubset(x))

df2 = df[df['date'].isin(target) & df['id'].map(id_ok)]

or, using transform:

target = {'2019Q4', '2020Q4', '2021Q4'}

mask = df.groupby('id')['date'].transform(lambda x: target.issubset(x))

df2 = df[df['date'].isin(target) & mask]

output:

   id    date  other
0   A  2019Q4    NaN
1   A  2020Q4    NaN
2   A  2021Q4    NaN
4   B  2019Q4    NaN
5   B  2020Q4    NaN
6   B  2021Q4    NaN
11  E  2019Q4    NaN
12  E  2020Q4    NaN
13  E  2021Q4    NaN

id_ok:

id
A     True
B     True
C    False
D    False
E     True
Name: date, dtype: bool

CodePudding user response:

Aggregate the dates for each ID. Then keep only the IDs which have all your dates

ids = df.groupby("id")["date"].agg(set).apply(lambda x: x.issuperset({"2019Q4", "2020Q4", "2021Q4"}))

>>> df[df["id"].isin(ids.index.where(ids).dropna())&df["date"].isin(["2019Q4", "2020Q4", "2021Q4"])]

   id    date
0   A  2019Q4
1   A  2020Q4
2   A  2021Q4
4   B  2019Q4
5   B  2020Q4
6   B  2021Q4
11  E  2019Q4
12  E  2020Q4
13  E  2021Q4
  • Related