JOIN two DataFrames and replace Column values in Python-CodePudding

I have dataframe df1:

    Expenses        Calendar    Actual
0   xyz             2020-01-01  10
1   xyz             2020-02-01  99
2   txn vol(new)    2020-01-01  5
3   txn vol(new)    2020-02-01  20
4   txn vol(tenu)   2020-01-01  30
5   txn vol(tenu)   2020-02-01  40

Second Dataframe df2:

    Expenses    Calendar    Actual
0   txn vol(new)    2020-01-01  23
1   txn vol(new)    2020-02-01  32
2   txn vol(tenu)   2020-01-01  60

Now I wanted to read all data from df1, and join on df2 with Expenses Calendar, then replace actual value in df1 from df2.

Expected output is:

    Expenses        Calendar    Actual
0   xyz             2020-01-01  10
1   xyz             2020-02-01  99
2   txn vol(new)    2020-01-01  23
3   txn vol(new)    2020-02-01  32
4   txn vol(tenu)   2020-01-01  60
5   txn vol(tenu)   2020-02-01  40

I am using below code

cols_to_replace = ['Actual']
df1.loc[df1.set_index(['Calendar','Expenses']).index.isin(df2.set_index(['Calendar','Expenses']).index), cols_to_replace] = df2.loc[df2.set_index(['Calendar','Expenses']).index.isin(df1.set_index(['Calendar','Expenses']).index),cols_to_replace].values

It is working when I have small data in df1. When it has (10K records), updates are happening with wrong values. df1 has 10K records, and df2 has 150 records. Could anyone please suggest how to resolve this?

Thank you

CodePudding user response：

here is one way to do it, using pd.merge

df=df.merge(df2,
        on=['Expenses', 'Calendar'],
        how='left',
        suffixes=('_x', None)).ffill(axis=1).drop(columns='Actual_x')
df['Actual']=df['Actual'].astype(int)
df

Expenses              Calendar  Actual
0   xyz             2020-01-01      10
1   xyz             2020-02-01      99
2   txn vol(new)    2020-01-01      23
3   txn vol(new)    2020-02-01      32
4   txn vol(tenu)   2020-01-01      60
5   txn vol(tenu)   2020-02-01      40

CodePudding user response：

If I understand your solution correctly, it seems to assume that (1) the Calendar-Expenses combinations are unique and (2) that their occurrences in both dataframes are aligned (same order)? I suspect that (2) isn't actually the case?

Another option - .merge() is fine! - could be:

df1 = df1.set_index(["Expenses", "Calendar"])
df2 = df2.set_index(["Expenses", "Calendar"])
df1.loc[df2.index, "Actual"] = df2["Actual"]
df2 = df2.reset_index()  # If the original df2 is still needed
df1 = df1.reset_index()