I'd like to create a new dataset (df3) by updating values based on the [area] and [Date] column match.
Here are the two datasets. df1 contains the update, df2 contains the file that will be updated with df1 data. If there is no date or area match, the bb, aa and cc values will be 0.
DATA
df1
Date area bb aa cc
7/1/2023 Australia 30 0 0
8/1/2023 Australia 50 0 0
4/1/2024 Australia 0 0 12
6/1/2024 Australia 30 0 24
9/1/2024 Australia 0 0 24
11/1/2024 Australia 0 0 24
2/1/2025 Australia 35 0 0
df2
Date area bb aa cc stat
03/1/2023 Australia 5 5 5 yes
06/1/2023 Australia 0 0 0 no
07/1/2023 Australia 0 0 0 yes
08/1/2023 Australia 0 0 0 yes
09/1/2023 Australia 0 0 0 no
10/1/2023 Australia 0 0 0 no
02/1/2024 Australia 0 0 0 no
03/1/2024 Australia 0 0 0 yes
04/1/2024 Australia 0 0 0 yes
05/1/2024 Australia 0 0 0 no
06/1/2024 Australia 0 0 0 no
9/1/2024 Australia 10 10 10 yes
11/1/2024 Australia 10 11 20 yes
2/1/2025 Australia 35 0 0 yes
1/1/2026 Malaysia 7 9 8 no
DESIRED
We see that the new dataset values are replaced with df1 values.(look at dates: 7/1/2023, 8/1/2023 etc.) If df1 does not contain df2 values, aa, bb, cc are set to 0. (look at 3/1/2023 and 1/1/2026 values)
df3
Date area bb aa cc stat
03/1/2023 Australia 0 0 0 yes
06/1/2023 Australia 0 0 0 no
07/1/2023 Australia 30 0 0 yes
08/1/2023 Australia 50 0 0 yes
09/1/2023 Australia 0 0 0 no
10/1/2023 Australia 0 0 0 no
02/1/2024 Australia 0 0 0 no
03/1/2024 Australia 0 0 0 yes
04/1/2024 Australia 0 0 12 yes
05/1/2024 Australia 0 0 0 no
06/1/2024 Australia 30 0 24 no
9/1/2024 Australia 0 0 24 yes
11/1/2024 Australia 0 0 24 yes
2/1/2025 Australia 35 0 0 yes
1/1/2026 Malaysia 0 0 0 no
Doing
I think it is best to use a join or merge- however, some of the data is not being replaced
df1["date"] = pd.to_datetime(df1["date"])
df2["date"] = pd.to_datetime(df2["date"])
df3 = pd.merge(left=df1, right=df2, on=["area", "date"], how="right").filter(regex=r".*(?<!_y)$")
df3.columns = df3.columns.str.split("_").str[0]
Any suggestion is appreciated. I am still researching.
CodePudding user response:
You can use:
# ensure same data
# you could also keep the strings and pad the zeros
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
keys = ['Date', 'area']
# align df1 on df2's keys and fill with 0s
df3 = (df2[keys ['stat']]
.merge(df1, on=keys, how='left')
.fillna(0, downcast='infer')
[df2.columns] # restore original order
)
output:
Date area bb aa cc stat
0 2023-03-01 Australia 0 0 0 yes
1 2023-06-01 Australia 0 0 0 no
2 2023-07-01 Australia 30 0 0 yes
3 2023-08-01 Australia 50 0 0 yes
4 2023-09-01 Australia 0 0 0 no
5 2023-10-01 Australia 0 0 0 no
6 2024-02-01 Australia 0 0 0 no
7 2024-03-01 Australia 0 0 0 yes
8 2024-04-01 Australia 0 0 12 yes
9 2024-05-01 Australia 0 0 0 no
10 2024-06-01 Australia 30 0 24 no
11 2024-09-01 Australia 0 0 24 yes
12 2024-11-01 Australia 0 0 24 yes
13 2025-02-01 Australia 35 0 0 yes
14 2026-01-01 Malaysia 0 0 0 no