Suppose I have 2 data frames, df1 contains set of IDs and names and df2 contains a set of IDs.
How do I compare both data frames to produce a 3rd data frame with a set of IDs and names that were not found in df2?
CodePudding user response:
This is what we call an "anti-join" in SQL world: do a left join between the two frames, then filter out any records in the result where the joins matched.
The way I like to do it is:
- If
df_2
has only theID
column, make a copy of the dataframe and add another dummy column:df_2_temp = df_2.copy().assign(dummy=1)
- Do the left
merge()
(I'm assumingID
is not an index, else you would usejoin()
):df_3 = df_1.merge(df_2_temp, how="left", left_on="ID", right_on="ID")
- Now filter out the records where the join was successful - where there was a match between the sets - and clean up by getting rid of
dummy
:df_3 = df_3[df_3.dummy.isna()].drop(columns="dummy")
Update: it occurs to me that the indicator
parameter of merge()
(here) can obviate the need for the dummy
column. Am leaving answer as is so that it makes the steps explicit.
CodePudding user response:
try subtracting two sets with ids and then filter on the result creating a new dataframe using append
lstID=['ID_0001','ID_0002','ID_0003','ID_0004','ID_0005','ID_0006','ID_0007','ID_0008','ID_0009']
lstName=['Brenda','Chloe','Jim','Ron','Eva','Winnie','Marc','Levi','Kiril']
lstID2=['ID_0003','ID_0005','ID_0006','ID_0008']
df=pd.DataFrame({'ID':lstID,'Name':lstName})
df2=pd.DataFrame({'ID':lstID2})
set1=set(df['ID'])
set2=set(df2['ID'])
result=set1-set2
print(result)
df_result=pd.DataFrame(columns=df.columns)
for i in result:
df_result=df_result.append(df[df['ID']==i])
print(df_result.sort_values(by='ID'))
output:
ID Name
0 ID_0001 Brenda
1 ID_0002 Chloe
3 ID_0004 Ron
6 ID_0007 Marc
8 ID_0009 Kiril
CodePudding user response:
What you need is anti_join()
:
>>> from datar.all import f, tibble, anti_join, paste0, seq
>>> df1 = tibble(
>>> ID=paste0("ID_000", seq(9)),
>>> name=["Brenda", "Choloe", "Jim", "Ron", "Eva", "Winnie", "Marc", "Levi", "Kiril"]
>>> )
>>> df1
ID name
<object> <object>
0 ID_0001 Brenda
1 ID_0002 Choloe
2 ID_0003 Jim
3 ID_0004 Ron
4 ID_0005 Eva
5 ID_0006 Winnie
6 ID_0007 Marc
7 ID_0008 Levi
8 ID_0009 Kiril
>>> df2 = tibble(ID=paste0("ID_000", [3, 5, 6, 8]))
>>> anti_join(df1, df2)
ID name
<object> <object>
0 ID_0001 Brenda
1 ID_0002 Choloe
3 ID_0004 Ron
6 ID_0007 Marc
8 ID_0009 Kiril
I am the author of the datar
package, which is backed by pandas and porting dplyr and related packages from R to python.