I am dealing with a problem like this:
mydata_old
| x | y |
|-----|------|
| 1 | 10 |
| 2 | 15 |
| 3 | 12 |
| 4 | 21 |
mydata_new
| x | y |
|-----|------|
| 4 | 20 |
| 5 | 25 |
| 6 | 30 |
I would like to combine both tables to get:
mydata
| x | y |
|-----|------|
| 1 | 10 |
| 2 | 15 |
| 3 | 12 |
| 4 | 20 | # x = 4 does not show up twice
| 5 | 25 |
| 6 | 30 |
I tried with pandas.merge
, but I cannot get the desired result.
mydata = pd.merge(mydata_new, mydata_old, how="left")
CodePudding user response:
One way to do this is with an outer join (i.e., keep the index from both dataframes). You can achieve this using pd.merge(mydata_new, mydata_old, how='outer')
and then keep the newest data available.
However, this could be more straight forward with pandas.concat
.
Here, I am concatenating both dataframes. Note that I exclude rows from mydata_old
if x
appears in mydata_new
. This way you can keep the newest data when it's available in both dataframes (old and new).
# Concat frames and if data is available in both, keep data from mydata_new
mydata = pd.concat(
[
mydata_old.loc[~mydata_old['x'].isin(mydata_new['x'])],
mydata_new
],
axis=0)
This will print the following:
x | y |
---|---|
1 | 10 |
2 | 15 |
3 | 12 |
4 | 20 |
5 | 25 |
6 | 30 |