Tricky update values from one dataset with another using Pandas (with conditions)-CodePudding

I'd like to create a new dataset (df3) by updating values based on the [area] and [Date] column match.

Here are the two datasets. df1 contains the update, df2 contains the file that will be updated with df1 data. If there is no date or area match, the bb, aa and cc values will be 0.

DATA

df1

Date        area        bb  aa  cc  
7/1/2023    Australia   30  0   0   
8/1/2023    Australia   50  0   0   
4/1/2024    Australia   0   0   12  
6/1/2024    Australia   30  0   24  
9/1/2024    Australia   0   0   24  
11/1/2024   Australia   0   0   24  
2/1/2025    Australia   35  0   0

df2

Date        area        bb  aa  cc  stat
03/1/2023   Australia   5   5   5   yes
06/1/2023   Australia   0   0   0   no
07/1/2023   Australia   0   0   0   yes
08/1/2023   Australia   0   0   0   yes
09/1/2023   Australia   0   0   0   no
10/1/2023   Australia   0   0   0   no
02/1/2024   Australia   0   0   0   no
03/1/2024   Australia   0   0   0   yes
04/1/2024   Australia   0   0   0   yes
05/1/2024   Australia   0   0   0   no
06/1/2024   Australia   0   0   0   no
9/1/2024    Australia   10  10  10  yes
11/1/2024   Australia   10  11  20  yes
2/1/2025    Australia   35  0   0   yes
1/1/2026    Malaysia    7   9   8   no

DESIRED

We see that the new dataset values are replaced with df1 values.(look at dates: 7/1/2023, 8/1/2023 etc.) If df1 does not contain df2 values, aa, bb, cc are set to 0. (look at 3/1/2023 and 1/1/2026 values)

df3

Date        area        bb  aa  cc  stat
03/1/2023   Australia   0   0   0   yes
06/1/2023   Australia   0   0   0   no
07/1/2023   Australia   30  0   0   yes
08/1/2023   Australia   50  0   0   yes
09/1/2023   Australia   0   0   0   no
10/1/2023   Australia   0   0   0   no
02/1/2024   Australia   0   0   0   no
03/1/2024   Australia   0   0   0   yes
04/1/2024   Australia   0   0   12  yes
05/1/2024   Australia   0   0   0   no
06/1/2024   Australia   30  0   24  no
9/1/2024    Australia   0   0   24  yes
11/1/2024   Australia   0   0   24  yes
2/1/2025    Australia   35  0   0   yes
1/1/2026    Malaysia    0   0   0   no

Doing

I think it is best to use a join or merge- however, some of the data is not being replaced

df1["date"] = pd.to_datetime(df1["date"])
df2["date"] = pd.to_datetime(df2["date"])

df3 = pd.merge(left=df1, right=df2, on=["area", "date"], how="right").filter(regex=r".*(?<!_y)$")
df3.columns = df3.columns.str.split("_").str[0]

Any suggestion is appreciated. I am still researching.

CodePudding user response：

You can use:

# ensure same data
# you could also keep the strings and pad the zeros
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])

keys = ['Date', 'area']
# align df1 on df2's keys and fill with 0s
df3 = (df2[keys ['stat']]
 .merge(df1, on=keys, how='left')
 .fillna(0, downcast='infer')
 [df2.columns] # restore original order
)

output:

         Date       area  bb  aa  cc stat
0  2023-03-01  Australia   0   0   0  yes
1  2023-06-01  Australia   0   0   0   no
2  2023-07-01  Australia  30   0   0  yes
3  2023-08-01  Australia  50   0   0  yes
4  2023-09-01  Australia   0   0   0   no
5  2023-10-01  Australia   0   0   0   no
6  2024-02-01  Australia   0   0   0   no
7  2024-03-01  Australia   0   0   0  yes
8  2024-04-01  Australia   0   0  12  yes
9  2024-05-01  Australia   0   0   0   no
10 2024-06-01  Australia  30   0  24   no
11 2024-09-01  Australia   0   0  24  yes
12 2024-11-01  Australia   0   0  24  yes
13 2025-02-01  Australia  35   0   0  yes
14 2026-01-01   Malaysia   0   0   0   no