I have df1 with around 3,67,000 rows. df2 has 30k rows. Their common columns are first_name, middle_name and last_name, where first name and last name are exact matches, and middle_name has some constraints.
The matched df has 20k rows. I want to make a dataframe containing df2-matched (30k-20k= 10k rows).
Essentially, I want to find the rows in df2 that were not a match to any rows in df1, but I cannot concat or merge because the columns are different.
CodePudding user response:
new_df = df2[~df2.index.isin(matched.index)]
Explanation: You are saying "keep only the rows in df2 that are not in the matched data frame, and save this as a new dataframe"
CodePudding user response:
Use the isin() method to compare the rows in df2 to df1 and filter out the ones that are not a match.
Create a boolean mask for the rows in df2 that match df1
Example:
mask=df2[['first_name','middle_name','last_name']].isin(df1[['first_name','middle_name','last_name']].to_dict('records')).all(1)
Use the mask to filter out the non-matching rows
non_matching_rows = df2[~mask]
This will filter out the rows in df2 that were not a match to any rows in df1. The resulting dataframe non_matching_rows will contain only the rows that did not match in the df1.
CodePudding user response:
I think what you're looking for is the difference of the two dataframes. I think a suitable solution would be concatenation without duplicates.
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
Sample solution:
import pandas as pd
df1 = pd.DataFrame({
'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
'Fruit':['Banana','Orange','Apple','Celery'],
'Num':[22.1,8.6,7.6,10.2],
'Color':['Yellow','Orange','Green','Green'],
})
df2 = pd.DataFrame({
'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
'Color':['Yellow','Orange','Green','Green','Red','Orange'],
})
df1=
Date Fruit Num Color
0 2013-11-24 Banana 22.1 Yellow
1 2013-11-24 Orange 8.6 Orange
2 2013-11-24 Apple 7.6 Green
3 2013-11-24 Celery 10.2 Green
df2=
Date Fruit Num Color
0 2013-11-24 Banana 22.1 Yellow
1 2013-11-24 Orange 8.6 Orange
2 2013-11-24 Apple 7.6 Green
3 2013-11-24 Celery 10.2 Green
4 2013-11-25 Apple 22.1 Red
5 2013-11-25 Orange 8.6 Orange
df_diff = pd.concat([df1,df2]).drop_duplicates(keep=False)
df_diff=
Date Fruit Num Color
4 2013-11-25 Apple 22.1 Red
5 2013-11-25 Orange 8.6 Orange