Home > OS >  Join 2 dataframes based on their differences while keeping common rows and columns python
Join 2 dataframes based on their differences while keeping common rows and columns python

Time:07-12

Context: I'm trying to add data from a new dataframe into an old dataframe where the only differences between them is that the data has been replaced with other information. I'll use an example to explain myself better:

Let's say the old dataframe has this structure:

test_old = pd.DataFrame.from_dict({'FactsEN' :['sales','price','promotion','sales','price','promotion'], 'Sales' : [12345,12,11,54321,14,12], 'Type' : ['type1','type1','type1','type2','type2','type2']})


test_new = pd.DataFrame.from_dict({'FactsEN' :['sales','price','new_promotion','sales','price','new_promotion'], 'Sales' : [12345,12,11,54321,14,13], 'Type' : ['type1','type1','type1','type2','type2','type2']})

How can I join both dataframes keeping the common columns and simply adding the differences between them? Ideally the dataframes would go from this:

>>> test_old
     FactsEN  Sales   Type
0      sales  12345  type1
1      price     12  type1
2  promotion     11  type1
3      sales  54321  type2
4      price     14  type2
5  promotion     12  type2

>>> test_new
         FactsEN  Sales   Type
0          sales  12345  type1
1          price     12  type1
2  new_promotion     11  type1
3          sales  54321  type2
4          price     14  type2
5  new_promotion     13  type2

To this:

         FactsEN  Sales   Type
0          sales  12345  type1
1          price     12  type1
2      promotion     11  type1
3  new_promotion     11  type1
4          sales  54321  type2
5          price     14  type2
6      promotion     12  type2
7  new_promotion     13  type2

I was trying concatenate and merges, but none of them managed to get the output...

concatenated_dataframe = pd.merge(old,new, how='outer', left_on=new.columns, right_on=old.columns)

Any help is very welcome!

CodePudding user response:

We can use pandas.DataFrame.merge and join on multiple columns.

pd.merge(test_old, test_new, on=['FactsEN',  'Sales',   'Type'], how="outer")

This gives us the expected data

         FactsEN  Sales   Type
0          sales  12345  type1
1          price     12  type1
2      promotion     11  type1
3          sales  54321  type2
4          price     14  type2
5      promotion     12  type2
6  new_promotion     11  type1
7  new_promotion     13  type2

CodePudding user response:

You can use pd.concat and then drop duplicates:

pd.concat([test_old, test_new], ignore_index=True).drop_duplicates().reset_index(drop=True)
  • Related