Comparing two data frames with different columns and getting the differences-CodePudding

I have a similar question as here Comparing two data frames and getting the differences But columns in df1 is a subset of columns in df2.

df1:
Date       Fruit  Num  Color 
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange  8.6 Orange
2013-11-24 Apple   7.6 Green
2013-11-24 Celery 10.2 Green

df2:
Date       Fruit  Num  Color  A
2013-11-24 Banana 22.1 Yellow 1 
2013-11-24 Orange  8.6 Orange 2
2013-11-24 Apple   7.6 Green  3 
2013-11-24 Celery 10.2 Green  4
2013-11-25 Apple  22.1 Red    5
2013-11-25 Orange  8.6 Orange 6

I would like to get the difference the two df by comparing those columns in common only. So the result I expect to get is

         Date   Fruit   Num   Color A
4  2013-11-25   Apple  22.1     Red 5
5  2013-11-25  Orange   8.6  Orange 6

Is there a way to do so? Any help is appreciated.

CodePudding user response：

If you don't have duplicates in df1 or df2[df1.columns] you could try to use .drop_duplicates with keep=False:

res = pd.concat([df1, df2[df1.columns]]).drop_duplicates(keep=False)

Result for your sample dataframes:

         Date   Fruit   Num   Color
4  2013-11-25   Apple  22.1     Red
5  2013-11-25  Orange   8.6  Orange

PS: As far as I can see the other answer also covers only the non-duplicate case.

If you do have duplicates in df1 or df2[df1.columns] and want to preserve them in the result you could try:

res = pd.concat(
    [df1.drop_duplicates(), df2[df1.columns].drop_duplicates()]
).drop_duplicates(keep=False).merge(
    pd.concat([df1, df2[df1.columns]]), on=list(df1.columns), how="inner"
)

First drop the duplicates in the originals, and then the duplicates on the concatenated dataframes: This will give you the rows that are not in both.
Then fetch from the originals the right amount of rows by an inner merge with the result of the first step.

For example, if df2 would look like

Date       Fruit  Num  Color  A
2013-11-24 Banana 22.1 Yellow 1 
2013-11-24 Orange  8.6 Orange 2
2013-11-24 Apple   7.6 Green  3 
2013-11-24 Celery 10.2 Green  4
2013-11-25 Apple  22.1 Red    5
2013-11-25 Apple  22.1 Red    5
2013-11-25 Orange  8.6 Orange 6

then the result would be

         Date   Fruit   Num   Color
0  2013-11-25   Apple  22.1     Red
1  2013-11-25   Apple  22.1     Red
2  2013-11-25  Orange   8.6  Orange

If you don't want to preserve the duplicates in the result then just do

res = pd.concat(
    [df1.drop_duplicates(), df2[df1.columns].drop_duplicates()]
).drop_duplicates(keep=False)

CodePudding user response：

First you get the column names of df1

df1_columns = df1.columns # ["Date", "Fruit", "Num", "Color"]

Now you create a new df2 dataframe with only df1 columns

df2_filtered = df2[df1_columns]

And now you can apply the solution from this other question.

#concatenate both dataframes
df = pd.concat([df1, df2_filtered])
df = df.reset_index(drop=True)

#group by
df_gpby = df.groupby(list(df.columns))

# get index of unique records
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]

#filter
df.reindex(idx)

Hope it helps!