Python: Finding individuals with conditions from two lists using Python (pandas)-CodePudding

I have the following dataframe:

df = 
id medication
1  A
1  B
1  A
2  Z
2  A
2  A
3  B
3  D 
3  A

I create two lists of medications:

ListA = ['A', 'Z']
ListB = ['B', 'C']

I want to obtain those individuals that have any of the medications from listA and any of the medications from listB so that the answer is:

dfOutput = 
id medication
1  A
1  B
3  B
3  A

So far I am trying the following:

dfOutput = df.groupby("id").filter(lambda x : pd.Series([*ListA,*ListB]).isin(x['medication']).all())

CodePudding user response：

(df.groupby('id')['medication']
  .agg(result = lambda x: x.isin(ListA).any() and x.isin(ListB).any())
  .reset_index().merge(df)
  .loc[lambda x: x['medication'].isin(ListA ListB)*x.result].drop_duplicates())

   id  result medication
0   1    True          A
1   1    True          B
6   3    True          B
8   3    True          A

CodePudding user response：

Solution without groupby.filter for improve performance:

ListA = ['A', 'Z']
ListB = ['B', 'C']

#test matching both lists
m1 = df['medication'].isin(ListA)
m2 = df['medication'].isin(ListB)
#get intersection of matched id filtere by both conditions
both = np.intersect1d(df.loc[m1, 'id'],df.loc[m2, 'id'])

#filter id with both masks m1 or m2, last tremove duplicates
df = df[df['id'].isin(both) & (m1 | m2)].drop_duplicates()
print (df)
   id medication
0   1          A
1   1          B
6   3          B
8   3          A

Performance:

N = 10000
df = pd.DataFrame({'id':np.random.randint(N // 30, size=N),
                   'medication':np.random.choice(list('ABCDEFGHIJKL'), size=N)}).sort_values('id', ignore_index=True)
print (df)

ListA = ['A', 'Z']
ListB = ['B', 'C']


In [131]: %%timeit
     ...: m1 = df['medication'].isin(ListA)
     ...: m2 = df['medication'].isin(ListB)
     ...: #get intersection of matched id filtere by both conditions
     ...: both = np.intersect1d(df.loc[m1, 'id'],df.loc[m2, 'id'])
     ...: 
     ...: #filter id with both masks m1 or m2, last tremove duplicates
     ...: df1 = df[df['id'].isin(both) & (m1 | m2)].drop_duplicates()
     ...: 
3.53 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [132]: %%timeit
     ...: (df.groupby('id')['medication']
     ...:   .agg(result = lambda x: x.isin(ListA).any() and x.isin(ListB).any())
     ...:   .reset_index().merge(df)
     ...:   .loc[lambda x: x['medication'].isin(ListA ListB)*x.result].drop_duplicates())
     ...:   

121 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)