I have the following dataframe:
df =
id medication
1 A
1 B
1 A
2 Z
2 A
2 A
3 B
3 D
3 A
I create two lists of medications:
ListA = ['A', 'Z']
ListB = ['B', 'C']
I want to obtain those individuals that have any of the medications from listA and any of the medications from listB so that the answer is:
dfOutput =
id medication
1 A
1 B
3 B
3 A
So far I am trying the following:
dfOutput = df.groupby("id").filter(lambda x : pd.Series([*ListA,*ListB]).isin(x['medication']).all())
CodePudding user response:
(df.groupby('id')['medication']
.agg(result = lambda x: x.isin(ListA).any() and x.isin(ListB).any())
.reset_index().merge(df)
.loc[lambda x: x['medication'].isin(ListA ListB)*x.result].drop_duplicates())
id result medication
0 1 True A
1 1 True B
6 3 True B
8 3 True A
CodePudding user response:
Solution without groupby.filter
for improve performance:
ListA = ['A', 'Z']
ListB = ['B', 'C']
#test matching both lists
m1 = df['medication'].isin(ListA)
m2 = df['medication'].isin(ListB)
#get intersection of matched id filtere by both conditions
both = np.intersect1d(df.loc[m1, 'id'],df.loc[m2, 'id'])
#filter id with both masks m1 or m2, last tremove duplicates
df = df[df['id'].isin(both) & (m1 | m2)].drop_duplicates()
print (df)
id medication
0 1 A
1 1 B
6 3 B
8 3 A
Performance:
N = 10000
df = pd.DataFrame({'id':np.random.randint(N // 30, size=N),
'medication':np.random.choice(list('ABCDEFGHIJKL'), size=N)}).sort_values('id', ignore_index=True)
print (df)
ListA = ['A', 'Z']
ListB = ['B', 'C']
In [131]: %%timeit
...: m1 = df['medication'].isin(ListA)
...: m2 = df['medication'].isin(ListB)
...: #get intersection of matched id filtere by both conditions
...: both = np.intersect1d(df.loc[m1, 'id'],df.loc[m2, 'id'])
...:
...: #filter id with both masks m1 or m2, last tremove duplicates
...: df1 = df[df['id'].isin(both) & (m1 | m2)].drop_duplicates()
...:
3.53 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [132]: %%timeit
...: (df.groupby('id')['medication']
...: .agg(result = lambda x: x.isin(ListA).any() and x.isin(ListB).any())
...: .reset_index().merge(df)
...: .loc[lambda x: x['medication'].isin(ListA ListB)*x.result].drop_duplicates())
...:
121 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)