Comparing 2 data frames and finding values are not in 2nd data frame-CodePudding

Suppose I have 2 data frames, df1 contains set of IDs and names and df2 contains a set of IDs.

How do I compare both data frames to produce a 3rd data frame with a set of IDs and names that were not found in df2?

CodePudding user response：

This is what we call an "anti-join" in SQL world: do a left join between the two frames, then filter out any records in the result where the joins matched.

The way I like to do it is:

If df_2 has only the ID column, make a copy of the dataframe and add another dummy column: df_2_temp = df_2.copy().assign(dummy=1)
Do the left merge() (I'm assuming ID is not an index, else you would use join()): df_3 = df_1.merge(df_2_temp, how="left", left_on="ID", right_on="ID")
Now filter out the records where the join was successful - where there was a match between the sets - and clean up by getting rid of dummy: df_3 = df_3[df_3.dummy.isna()].drop(columns="dummy")

Update: it occurs to me that the indicator parameter of merge() (here) can obviate the need for the dummy column. Am leaving answer as is so that it makes the steps explicit.

CodePudding user response：

try subtracting two sets with ids and then filter on the result creating a new dataframe using append

lstID=['ID_0001','ID_0002','ID_0003','ID_0004','ID_0005','ID_0006','ID_0007','ID_0008','ID_0009']
lstName=['Brenda','Chloe','Jim','Ron','Eva','Winnie','Marc','Levi','Kiril']
lstID2=['ID_0003','ID_0005','ID_0006','ID_0008']
df=pd.DataFrame({'ID':lstID,'Name':lstName})
df2=pd.DataFrame({'ID':lstID2})

set1=set(df['ID'])
set2=set(df2['ID'])
result=set1-set2
print(result)

df_result=pd.DataFrame(columns=df.columns)
for i in result:
    df_result=df_result.append(df[df['ID']==i])
        

print(df_result.sort_values(by='ID'))

output:

    ID    Name
0  ID_0001  Brenda
1  ID_0002   Chloe
3  ID_0004     Ron
6  ID_0007    Marc
8  ID_0009   Kiril

CodePudding user response：

What you need is anti_join():

>>> from datar.all import f, tibble, anti_join, paste0, seq
>>> df1 = tibble(
>>>   ID=paste0("ID_000", seq(9)), 
>>>   name=["Brenda", "Choloe", "Jim", "Ron", "Eva", "Winnie", "Marc", "Levi", "Kiril"]
>>> )
>>> df1
        ID     name
  <object> <object>
0  ID_0001   Brenda
1  ID_0002   Choloe
2  ID_0003      Jim
3  ID_0004      Ron
4  ID_0005      Eva
5  ID_0006   Winnie
6  ID_0007     Marc
7  ID_0008     Levi
8  ID_0009    Kiril
>>> df2 = tibble(ID=paste0("ID_000", [3, 5, 6, 8]))
>>> anti_join(df1, df2)
        ID     name
  <object> <object>
0  ID_0001   Brenda
1  ID_0002   Choloe
3  ID_0004      Ron
6  ID_0007     Marc
8  ID_0009    Kiril

I am the author of the datar package, which is backed by pandas and porting dplyr and related packages from R to python.