I have the following dataframe
id date other variables..
A 2019Q4
A 2020Q4
A 2021Q4
B 2018Q4
B 2019Q4
B 2020Q4
B 2021Q4
C 2020Q4
C 2021Q4
D 2021Q4
E 2018Q4
E 2019Q4
E 2020Q4
E 2021Q4
. .
I want to group by id and keep those ids if it contains all of the designated values (i.e. 2019Q4, 2020Q4, 2021Q4
) then extract rows that correspond to those values. isin()
won't work because it won't drop C and D.
desired output
A 2019Q4
A 2020Q4
A 2021Q4
B 2019Q4
B 2020Q4
B 2021Q4
E 2019Q4
E 2020Q4
E 2021Q4
. .
CodePudding user response:
You can use set operations to filter the id and isin
for the date:
target = {'2019Q4', '2020Q4', '2021Q4'}
id_ok = df.groupby('id')['date'].agg(lambda x: target.issubset(x))
df2 = df[df['date'].isin(target) & df['id'].map(id_ok)]
or, using transform
:
target = {'2019Q4', '2020Q4', '2021Q4'}
mask = df.groupby('id')['date'].transform(lambda x: target.issubset(x))
df2 = df[df['date'].isin(target) & mask]
output:
id date other
0 A 2019Q4 NaN
1 A 2020Q4 NaN
2 A 2021Q4 NaN
4 B 2019Q4 NaN
5 B 2020Q4 NaN
6 B 2021Q4 NaN
11 E 2019Q4 NaN
12 E 2020Q4 NaN
13 E 2021Q4 NaN
id_ok
:
id
A True
B True
C False
D False
E True
Name: date, dtype: bool
CodePudding user response:
Aggregate the dates for each ID. Then keep only the IDs which have all your dates
ids = df.groupby("id")["date"].agg(set).apply(lambda x: x.issuperset({"2019Q4", "2020Q4", "2021Q4"}))
>>> df[df["id"].isin(ids.index.where(ids).dropna())&df["date"].isin(["2019Q4", "2020Q4", "2021Q4"])]
id date
0 A 2019Q4
1 A 2020Q4
2 A 2021Q4
4 B 2019Q4
5 B 2020Q4
6 B 2021Q4
11 E 2019Q4
12 E 2020Q4
13 E 2021Q4