Make a dataframe containing rows that were not matched after merging df1 and df2-CodePudding

I have df1 with around 3,67,000 rows. df2 has 30k rows. Their common columns are first_name, middle_name and last_name, where first name and last name are exact matches, and middle_name has some constraints.

The matched df has 20k rows. I want to make a dataframe containing df2-matched (30k-20k= 10k rows).

Essentially, I want to find the rows in df2 that were not a match to any rows in df1, but I cannot concat or merge because the columns are different.

CodePudding user response：

new_df = df2[~df2.index.isin(matched.index)]

Explanation: You are saying "keep only the rows in df2 that are not in the matched data frame, and save this as a new dataframe"

CodePudding user response：

Use the isin() method to compare the rows in df2 to df1 and filter out the ones that are not a match.

Create a boolean mask for the rows in df2 that match df1

Example:

mask=df2[['first_name','middle_name','last_name']].isin(df1[['first_name','middle_name','last_name']].to_dict('records')).all(1)

Use the mask to filter out the non-matching rows

non_matching_rows = df2[~mask]

This will filter out the rows in df2 that were not a match to any rows in df1. The resulting dataframe non_matching_rows will contain only the rows that did not match in the df1.

CodePudding user response：

I think what you're looking for is the difference of the two dataframes. I think a suitable solution would be concatenation without duplicates.

df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)

Sample solution:

import pandas as pd

df1 = pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
    'Fruit':['Banana','Orange','Apple','Celery'],
    'Num':[22.1,8.6,7.6,10.2],
    'Color':['Yellow','Orange','Green','Green'],
    })

df2 = pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
    'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
    'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
    'Color':['Yellow','Orange','Green','Green','Red','Orange'],
    })


df1=
       Date      Fruit  Num      Color
0   2013-11-24  Banana  22.1     Yellow
1   2013-11-24  Orange  8.6      Orange
2   2013-11-24  Apple   7.6      Green
3   2013-11-24  Celery  10.2     Green

df2=
        Date    Fruit    Num    Color
0   2013-11-24  Banana  22.1    Yellow
1   2013-11-24  Orange  8.6     Orange
2   2013-11-24  Apple   7.6     Green
3   2013-11-24  Celery  10.2    Green
4   2013-11-25  Apple   22.1    Red
5   2013-11-25  Orange  8.6     Orange

df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
df_diff=
         Date   Fruit   Num    Color
4   2013-11-25  Apple   22.1    Red
5   2013-11-25  Orange  8.6    Orange