I have dataframe df1:
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 5
3 txn vol(new) 2020-02-01 20
4 txn vol(tenu) 2020-01-01 30
5 txn vol(tenu) 2020-02-01 40
Second Dataframe df2:
Expenses Calendar Actual
0 txn vol(new) 2020-01-01 23
1 txn vol(new) 2020-02-01 32
2 txn vol(tenu) 2020-01-01 60
Now I wanted to read all data from df1, and join on df2 with Expenses Calendar, then replace actual value in df1 from df2.
Expected output is:
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 23
3 txn vol(new) 2020-02-01 32
4 txn vol(tenu) 2020-01-01 60
5 txn vol(tenu) 2020-02-01 40
I am using below code
cols_to_replace = ['Actual']
df1.loc[df1.set_index(['Calendar','Expenses']).index.isin(df2.set_index(['Calendar','Expenses']).index), cols_to_replace] = df2.loc[df2.set_index(['Calendar','Expenses']).index.isin(df1.set_index(['Calendar','Expenses']).index),cols_to_replace].values
It is working when I have small data in df1. When it has (10K records), updates are happening with wrong values. df1 has 10K records, and df2 has 150 records. Could anyone please suggest how to resolve this?
Thank you
CodePudding user response:
here is one way to do it, using pd.merge
df=df.merge(df2,
on=['Expenses', 'Calendar'],
how='left',
suffixes=('_x', None)).ffill(axis=1).drop(columns='Actual_x')
df['Actual']=df['Actual'].astype(int)
df
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 23
3 txn vol(new) 2020-02-01 32
4 txn vol(tenu) 2020-01-01 60
5 txn vol(tenu) 2020-02-01 40
CodePudding user response:
If I understand your solution correctly, it seems to assume that (1) the Calendar
-Expenses
combinations are unique and (2) that their occurrences in both dataframes are aligned (same order)? I suspect that (2) isn't actually the case?
Another option - .merge()
is fine! - could be:
df1 = df1.set_index(["Expenses", "Calendar"])
df2 = df2.set_index(["Expenses", "Calendar"])
df1.loc[df2.index, "Actual"] = df2["Actual"]
df2 = df2.reset_index() # If the original df2 is still needed
df1 = df1.reset_index()