I have two datasets, one has old data and one has updated data. I'd like to create a new dataset by updating values based on if the area, date and column values match.
Data
df1
area date aa bb cc
japan 10/1/2027 1 0 0
us 1/1/2022 5 5 5
fiji 11/2/2026 1 1 1
df2
area date aa bb cc stat
japan 10/1/2027 0 5 5 yes
fiji 11/2/2026 0 0 10 no
I have two datasets. I wish to replace the values in [aa], [bb], and [cc] columns of df2 with the updated values from df1 if we have the same date and area values. The aa, bb, and cc column are replaced with the updated values.
Desired
area date aa bb cc stat
japan 10/1/2027 1 0 0 yes
fiji 11/2/2026 1 1 1 no
Doing
df['date'] = df.date.apply(lambda x: np.nan if x == ' ' else x)
I am not exactly sure how to set this up, however, I have an idea. Any suggestion is appreciated
CodePudding user response:
You can merge
and combine_first
:
cols = ['area', 'date']
out = (df2[cols].merge(df1, on=cols, how='left')
.combine_first(df2)[df2.columns]
)
output:
area date aa bb cc stat
0 japan 10/1/2027 1 0 0 yes
1 fiji 11/2/2026 1 1 1 no
CodePudding user response:
Using .merge and making sure date columns in both dfs are set to datetime.
df1["date"] = pd.to_datetime(df1["date"])
df2["date"] = pd.to_datetime(df2["date"])
df3 = pd.merge(left=df1, right=df2, on=["area", "date"], how="right").filter(regex=r".*(?<!_y)$")
df3.columns = df3.columns.str.split("_").str[0]
print(df3)
area date aa bb cc stat
0 japan 2027-10-01 1 0 0 yes
1 fiji 2026-11-02 1 1 1 no
CodePudding user response:
I think this can possibly be simplified to:
output = df1[df1['area'].isin(df2['area']) & df1['date'].isin(df2['date'])]
OUTPUT:
area date aa bb cc stat
japan 10/1/2027 1 0 0 yes
fiji 11/2/2026 1 1 1 no
Even when df1 looks like this:
DF1:
area date aa bb cc
0 japan 10/1/2027 1 0 0
1 us 1/1/2022 5 5 5
2 fiji 11/2/2026 1 1 1
3 fiji 12/5/2025 9 9 9