I have a similar question as here Comparing two data frames and getting the differences But columns in df1 is a subset of columns in df2.
df1:
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
df2:
Date Fruit Num Color A
2013-11-24 Banana 22.1 Yellow 1
2013-11-24 Orange 8.6 Orange 2
2013-11-24 Apple 7.6 Green 3
2013-11-24 Celery 10.2 Green 4
2013-11-25 Apple 22.1 Red 5
2013-11-25 Orange 8.6 Orange 6
I would like to get the difference the two df by comparing those columns in common only. So the result I expect to get is
Date Fruit Num Color A
4 2013-11-25 Apple 22.1 Red 5
5 2013-11-25 Orange 8.6 Orange 6
Is there a way to do so? Any help is appreciated.
CodePudding user response:
If you don't have duplicates in df1
or df2[df1.columns]
you could try to use .drop_duplicates
with keep=False
:
res = pd.concat([df1, df2[df1.columns]]).drop_duplicates(keep=False)
Result for your sample dataframes:
Date Fruit Num Color
4 2013-11-25 Apple 22.1 Red
5 2013-11-25 Orange 8.6 Orange
PS: As far as I can see the other answer also covers only the non-duplicate case.
If you do have duplicates in df1
or df2[df1.columns]
and want to preserve them in the result you could try:
res = pd.concat(
[df1.drop_duplicates(), df2[df1.columns].drop_duplicates()]
).drop_duplicates(keep=False).merge(
pd.concat([df1, df2[df1.columns]]), on=list(df1.columns), how="inner"
)
- First drop the duplicates in the originals, and then the duplicates on the concatenated dataframes: This will give you the rows that are not in both.
- Then fetch from the originals the right amount of rows by an inner merge with the result of the first step.
For example, if df2
would look like
Date Fruit Num Color A
2013-11-24 Banana 22.1 Yellow 1
2013-11-24 Orange 8.6 Orange 2
2013-11-24 Apple 7.6 Green 3
2013-11-24 Celery 10.2 Green 4
2013-11-25 Apple 22.1 Red 5
2013-11-25 Apple 22.1 Red 5
2013-11-25 Orange 8.6 Orange 6
then the result would be
Date Fruit Num Color
0 2013-11-25 Apple 22.1 Red
1 2013-11-25 Apple 22.1 Red
2 2013-11-25 Orange 8.6 Orange
If you don't want to preserve the duplicates in the result then just do
res = pd.concat(
[df1.drop_duplicates(), df2[df1.columns].drop_duplicates()]
).drop_duplicates(keep=False)
CodePudding user response:
First you get the column names of df1
df1_columns = df1.columns # ["Date", "Fruit", "Num", "Color"]
Now you create a new df2 dataframe with only df1 columns
df2_filtered = df2[df1_columns]
And now you can apply the solution from this other question.
#concatenate both dataframes
df = pd.concat([df1, df2_filtered])
df = df.reset_index(drop=True)
#group by
df_gpby = df.groupby(list(df.columns))
# get index of unique records
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
#filter
df.reindex(idx)
Hope it helps!