Context: I'm trying to add data from a new dataframe into an old dataframe where the only differences between them is that the data has been replaced with other information. I'll use an example to explain myself better:
Let's say the old dataframe has this structure:
test_old = pd.DataFrame.from_dict({'FactsEN' :['sales','price','promotion','sales','price','promotion'], 'Sales' : [12345,12,11,54321,14,12], 'Type' : ['type1','type1','type1','type2','type2','type2']})
test_new = pd.DataFrame.from_dict({'FactsEN' :['sales','price','new_promotion','sales','price','new_promotion'], 'Sales' : [12345,12,11,54321,14,13], 'Type' : ['type1','type1','type1','type2','type2','type2']})
How can I join both dataframes keeping the common columns and simply adding the differences between them? Ideally the dataframes would go from this:
>>> test_old
FactsEN Sales Type
0 sales 12345 type1
1 price 12 type1
2 promotion 11 type1
3 sales 54321 type2
4 price 14 type2
5 promotion 12 type2
>>> test_new
FactsEN Sales Type
0 sales 12345 type1
1 price 12 type1
2 new_promotion 11 type1
3 sales 54321 type2
4 price 14 type2
5 new_promotion 13 type2
To this:
FactsEN Sales Type
0 sales 12345 type1
1 price 12 type1
2 promotion 11 type1
3 new_promotion 11 type1
4 sales 54321 type2
5 price 14 type2
6 promotion 12 type2
7 new_promotion 13 type2
I was trying concatenate and merges, but none of them managed to get the output...
concatenated_dataframe = pd.merge(old,new, how='outer', left_on=new.columns, right_on=old.columns)
Any help is very welcome!
CodePudding user response:
We can use pandas.DataFrame.merge
and join on multiple columns.
pd.merge(test_old, test_new, on=['FactsEN', 'Sales', 'Type'], how="outer")
This gives us the expected data
FactsEN Sales Type
0 sales 12345 type1
1 price 12 type1
2 promotion 11 type1
3 sales 54321 type2
4 price 14 type2
5 promotion 12 type2
6 new_promotion 11 type1
7 new_promotion 13 type2
CodePudding user response:
You can use pd.concat and then drop duplicates:
pd.concat([test_old, test_new], ignore_index=True).drop_duplicates().reset_index(drop=True)