Want an Upsert on the index
df1
is long time-series (historical)df2
is an update which will have new entries, but may overlapdf1
. On overlap,df2
takes precedence.df3
is the output that I want.
I tried calling an outer merge, but my indices are being ignored.
I also tried setting left_index
and right_index
, but that didn't work, either.
import pandas as pd
idx1 = pd.date_range('2015-10-05 10:00:00', '2015-10-05 10:04:00', freq='1min')
df1 = pd.DataFrame({'x': [1,2,3,4,5]}, index=idx1)
idx2 = pd.date_range('2015-10-05 10:03:00', '2015-10-05 10:07:00', freq='1min')
df2 = pd.DataFrame({'x': [10,20,30,40,50]}, index=idx2)
idx3 = pd.date_range('2015-10-05 10:00:00', '2015-10-05 10:07:00', freq='1min')
df3 = pd.DataFrame({'x': [1,2,3,10,20,30,40,50]}, index=idx3)
# pd.merge(df1, df2, how='outer') # index being ignored
# df = pd.merge(df_hist, df_update, left_index=True, right_index=True, how='outer') # also bad
CodePudding user response:
You can append
then groupby
: Notice the order of append do matter in your case, since you would like to keep df2 value
out = df2.append(df1).groupby(level=0).head(1).sort_index()
Out[231]:
x
2015-10-05 10:00:00 1
2015-10-05 10:01:00 2
2015-10-05 10:02:00 3
2015-10-05 10:03:00 10
2015-10-05 10:04:00 20
2015-10-05 10:05:00 30
2015-10-05 10:06:00 40
2015-10-05 10:07:00 50
CodePudding user response:
Just pd.concat
your dataframes and then use x.loc
a lambda function with duplicated(keep='last')
:
df3 = pd.concat([df1, df2]).loc[lambda x: ~x.index.duplicated(keep='last')]
Output:
>>> df3
x
2015-10-05 10:00:00 1
2015-10-05 10:01:00 2
2015-10-05 10:02:00 3
2015-10-05 10:03:00 10
2015-10-05 10:04:00 20
2015-10-05 10:05:00 30
2015-10-05 10:06:00 40
2015-10-05 10:07:00 50
CodePudding user response:
We can use combine_first
here:
df3 = df2.combine_first(df1)
df3
Output:
x
2015-10-05 10:00:00 1
2015-10-05 10:01:00 2
2015-10-05 10:02:00 3
2015-10-05 10:03:00 10
2015-10-05 10:04:00 20
2015-10-05 10:05:00 30
2015-10-05 10:06:00 40
2015-10-05 10:07:00 50