Home > Net >  Find the difference between two columns in a dataframe but keeping the row index avaiable
Find the difference between two columns in a dataframe but keeping the row index avaiable

Time:11-23

I have two dataframes, df1 = pd.dataFrame({"product":['apples', 'bananas', 'oranges', 'kiwi']}) df2 = pd.dataframe({"product":['apples', 'aples', 'appples', 'banans', 'oranges', 'kiwki'], "key"[1 2 3 4 5 6]})

I want to use something like a set(df2).difference(df1) to find the difference between the product columns but I want to keep the indexes. So ideally the output would look like this

result =['aples', 'appples', 'banas', 'kiwki'][2 3 4 6]

Whenever I use the set.difference() I get the list of the different values but I lose the key index.

CodePudding user response:

I guess you are trying to do a left anti join, which means you only want to keep the rows in df2 that aren't present in df1. In that case:

df1 = pd.DataFrame({"product":['apples', 'bananas', 'oranges', 'kiwi']})
df2 = pd.DataFrame({"product":['apples', 'aples', 'appples', 'banans', 'oranges', 'kiwki'], "key":[1, 2, 3, 4, 5, 6]})

# left join
joined_df = df2.merge(df1, on='product', how='left', indicator=True)
# keeping products that were only present in df2
products_only_in_df2 = joined_df.loc[joined_df['_merge'] == 'left_only', 'product']
# filtering df2 using the above df so we have the keys as well
result = df2[df2['product'].isin(products_only_in_df2)]

CodePudding user response:

You have to filter the df2 frame checking if the elements from df2 are not in df1:

df2[~df2["product"].isin(df1['product'])]
  • ~ negates the values of a boolean Series.
  • ser1.isin(ser2) is a boolean Series which gives, for each element of ser 1, whether or not the value can be found in ser2.
  • Related