Lets say we have two dataframe called df
which is the data frame that we want modification and a
reference data
frame as the data frame should look like;
but df
is sometimes partially missing some rows from the reference
dataframe
import pandas as pd
df = pd.DataFrame({'gr1f': ['A','A','B','A','B','A','B'],
'gr2p': ['CC','CC','CC','CC','CC','CC','CC'],
'gr3a': ['AL','AL','AL','DEL','DEL','DEL','DEL'],
'DP': [ 'l1yrs', 'l10yrs','l10yrs','l1yrs','l1yrs','l10yrs','l10yrs'],
'num_vals1' :[
1.138044999,
1.221786568,
1.7,
1.159030763,
1.2,
1.5,
1.238341765 ],
'num_vals2' :[
0.166637328,
0.04513741,
0.12,
0.141340498,
0.87,
0.033441602,
0.01233441602]})
df
reference_df = pd.DataFrame({
'gr2p': ['CC','CC','CC','CC','DD','DD','DD','DD'],
'gr3a': ['AL','AL','AL','AL','DEL','DEL','DEL','DEL'],
'DP': ['l1yrs', 'l1yrs', 'l10yrs','l10yrs','l1yrs','l1yrs','l10yrs','l10yrs']})
reference_df
to merge this two data frame I used right join
The expected output that I need to have
gr1f gr2p gr3a DP num_vals1 num_vals2 source
0 A CC AL l10yrs 1.221787 0.045137 df
1 B CC AL l10yrs 1.700000 0.120000 df
2 A CC AL l1yrs 1.138045 0.166637 df
3 B CC AL l1yrs NaN NaN reference_df
4 A CC DEL l10yrs 1.50000 0.033441602 df
5 B CC DEL l10yrs 1.238342 0.012334 df
6 A CC DEL l1yrs 1.15903 0.1413404 df
7 B CC DEL l1yrs 1.20000 0.87 df
8 A DD DEL l10yrs NaN NaN reference_df
9 B DD DEL l10yrs NaN NaN reference_df
10 A DD DEL l1yrs NaN NaN reference_df
11 B DD DEL l1yrs NaN NaN reference_df
how can we achieve this final table with filling missing rows gr1f
with A
and B
combinations and DP
with 'l10yrs, l1yrs' combination. I think the expected output is showing the pattern so its a good guide.
CodePudding user response:
There's indicator
option you can use. Also, you're not looking for right merge, you are looking for outer merge:
df1 = (df.merge(reference_df[['gr2p','gr3a','DP']]
.assign(gr1f=['A','B']*(len(reference_df)//2)),
on = ['gr1f','gr2p','gr3a','DP'],
how='outer', indicator=True)
.sort_values(['gr2p','gr3a'])
.assign(source=lambda x: x._merge.map({'both':'df', 'right_only':'ref', 'left_only':'df'}))
.reset_index(drop=True)
)
Output:
gr1f gr2p gr3a DP num_vals1 num_vals2 _merge source
0 A CC AL l1yrs 1.138045 0.166637 both df
1 A CC AL l10yrs 1.221787 0.045137 both df
2 B CC AL l10yrs 1.700000 0.120000 both df
3 B CC AL l1yrs NaN NaN right_only ref
4 A CC DEL l1yrs 1.159031 0.141340 left_only df
5 B CC DEL l1yrs 1.200000 0.870000 left_only df
6 A CC DEL l10yrs 1.500000 0.033442 left_only df
7 B CC DEL l10yrs 1.238342 0.012334 left_only df
8 A DD DEL l1yrs NaN NaN right_only ref
9 B DD DEL l1yrs NaN NaN right_only ref
10 A DD DEL l10yrs NaN NaN right_only ref
11 B DD DEL l10yrs NaN NaN right_only ref