I have two dataframes like as below
ID,Name,Sub,Country
1,ABC,ENG,UK
1,ABC,MATHS,UK
1,ABC,Science,UK
2,ABE,ENG,USA
2,ABE,MATHS,USA
2,ABE,Science,USA
3,ABF,ENG,IND
3,ABF,MATHS,IND
3,ABF,Science,IND
df1 = pd.read_clipboard(sep=',')
ID,Name,class,age
11,ABC,ENG,21
12,ABC,MATHS,23
1,ABC,Science,25
22,ABE,ENG,19
23,ABE,MATHS,22
24,ABE,Science,26
33,ABF,ENG,24
31,ABF,MATHS,28
32,ABF,Science,26
df2 = pd.read_clipboard(sep=',')
I would like to do the below
a) Check whether the ID
and Name
from df1
is present in df2
.
b) If present in df2
, put Yes
in Status column or No
in Status column. Don't use ~
or not in
operator because my df2 has million of rows. So, it will result in irrelevant results
I tried the below
ID_list = df1['ID'].unique().tolist()
Name_list = df1['Name'].unique().tolist()
filtered_df = df2[((df2['ID'].isin(ID_list)) & (df2['Name'].isin(Name_list)))]
filtered_df = filtered_df.groupby(['ID','Name','Sub']).size().reset_index()
The above code gives matching ids and names between df1
and df2
.
But I want to find the ids
and names
that are present in df1
but missing/absent in df2
. I cannot use the ~
operator because it will return all the rows from df2
that don't have a match in df1
. In real world, my df2 has millions of rows. I only want to find the missing df1 ids and names and put a status column
I expect my output to be like as below
ID,Name,Sub,Country, Status
1,ABC,ENG,UK,No
1,ABC,MATHS,UK,No
1,ABC,Science,UK,Yes
2,ABE,ENG,USA,No
2,ABE,MATHS,USA,No
2,ABE,Science,USA,No
3,ABF,ENG,IND,No
3,ABF,MATHS,IND,No
3,ABF,Science,IND,No
CodePudding user response:
Expected ouput is for match by 3 columns:
m = df1.merge(df2,
left_on=['ID','Name','Sub'],
right_on=['ID','Name','class'],
indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
With testing by isin
solution is:
idx1 = pd.MultiIndex.from_frame(df1[['ID','Name','Sub']])
idx2 = pd.MultiIndex.from_frame(df2[['ID','Name','class']])
df1['Status'] = np.where(idx1.isin(idx2), 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
Because if match by 2 columns ouput is different:
m = df1.merge(df2, on=['ID','Name'], indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK Yes
1 1 ABC MATHS UK Yes
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No