Home > Mobile >  Pandas filter without ~ and not in operator
Pandas filter without ~ and not in operator

Time:06-10

I have two dataframes like as below

ID,Name,Sub,Country
1,ABC,ENG,UK
1,ABC,MATHS,UK
1,ABC,Science,UK
2,ABE,ENG,USA
2,ABE,MATHS,USA
2,ABE,Science,USA
3,ABF,ENG,IND
3,ABF,MATHS,IND
3,ABF,Science,IND

df1 = pd.read_clipboard(sep=',')

ID,Name,class,age
11,ABC,ENG,21
12,ABC,MATHS,23
1,ABC,Science,25
22,ABE,ENG,19
23,ABE,MATHS,22
24,ABE,Science,26
33,ABF,ENG,24
31,ABF,MATHS,28
32,ABF,Science,26

df2 = pd.read_clipboard(sep=',')

I would like to do the below

a) Check whether the ID and Name from df1 is present in df2.

b) If present in df2, put Yes in Status column or No in Status column. Don't use ~ or not in operator because my df2 has million of rows. So, it will result in irrelevant results

I tried the below

ID_list = df1['ID'].unique().tolist()
Name_list = df1['Name'].unique().tolist()
filtered_df = df2[((df2['ID'].isin(ID_list)) & (df2['Name'].isin(Name_list)))]
filtered_df = filtered_df.groupby(['ID','Name','Sub']).size().reset_index()

The above code gives matching ids and names between df1 and df2.

But I want to find the ids and names that are present in df1 but missing/absent in df2. I cannot use the ~ operator because it will return all the rows from df2 that don't have a match in df1. In real world, my df2 has millions of rows. I only want to find the missing df1 ids and names and put a status column

I expect my output to be like as below

ID,Name,Sub,Country, Status
1,ABC,ENG,UK,No
1,ABC,MATHS,UK,No
1,ABC,Science,UK,Yes
2,ABE,ENG,USA,No
2,ABE,MATHS,USA,No
2,ABE,Science,USA,No
3,ABF,ENG,IND,No
3,ABF,MATHS,IND,No
3,ABF,Science,IND,No

CodePudding user response:

Expected ouput is for match by 3 columns:

m = df1.merge(df2, 
              left_on=['ID','Name','Sub'],
              right_on=['ID','Name','class'], 
              indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
   ID Name      Sub Country Status
0   1  ABC      ENG      UK     No
1   1  ABC    MATHS      UK     No
2   1  ABC  Science      UK    Yes
3   2  ABE      ENG     USA     No
4   2  ABE    MATHS     USA     No
5   2  ABE  Science     USA     No
6   3  ABF      ENG     IND     No
7   3  ABF    MATHS     IND     No
8   3  ABF  Science     IND     No

With testing by isin solution is:

idx1 = pd.MultiIndex.from_frame(df1[['ID','Name','Sub']])
idx2 = pd.MultiIndex.from_frame(df2[['ID','Name','class']]) 
df1['Status'] = np.where(idx1.isin(idx2), 'Yes', 'No')
print (df1)
   ID Name      Sub Country Status
0   1  ABC      ENG      UK     No
1   1  ABC    MATHS      UK     No
2   1  ABC  Science      UK    Yes
3   2  ABE      ENG     USA     No
4   2  ABE    MATHS     USA     No
5   2  ABE  Science     USA     No
6   3  ABF      ENG     IND     No
7   3  ABF    MATHS     IND     No
8   3  ABF  Science     IND     No

Because if match by 2 columns ouput is different:

m = df1.merge(df2, on=['ID','Name'], indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
   ID Name      Sub Country Status
0   1  ABC      ENG      UK    Yes
1   1  ABC    MATHS      UK    Yes
2   1  ABC  Science      UK    Yes
3   2  ABE      ENG     USA     No
4   2  ABE    MATHS     USA     No
5   2  ABE  Science     USA     No
6   3  ABF      ENG     IND     No
7   3  ABF    MATHS     IND     No
8   3  ABF  Science     IND     No
  • Related