How to merge 2 dataframes by aligning the values in 2 columns?-CodePudding

I have the following dataframe df to process:

    Name     C1     Value_1      C2     Value_2
0      A    112        2.36     112        3.77
1      A    211        1.13     122        2.53
2      A    242        1.22     211        1.13
3      A    245        3.87     242        1.38
4      A    311        3.13     243        4.00
5      A    312        7.11     311        2.07
6      A    NaN         NaN     312        7.11
7      A    NaN         NaN     324        1.06

As you can see, the 2 columns of "codes", C1 and C2, are not aligned on the same levels: codes 122, 243, 324 (in column C2) do not appear in column C1, and code 245 (in column C1) does not appear in column C2.

I would like to reconstruct a file where the codes are aligned according to their value, so as to obtain this:

    Name     C1     Value_1      C2     Value_2
0      A    112        2.36     112        3.77
1      A    122         NaN     122        2.53
2      A    211        1.13     211        1.13
3      A    242        1.22     242        1.38
4      A    243         NaN     243        4.00
5      A    245        3.87     245         NaN
6      A    311        3.13     311        2.07
7      A    312        7.11     312        7.11
8      A    324         NaN     324        1.06

In order to do so, I thought of creating 2 subsets:

left = df[['Name', 'C1', 'Value_1']]
right = df[['Name', 'C2', 'Value_2']]

and I tried to merge them, manipulating the function merge:

left.merge(right, on=..., how=..., suffixes=...)

but I got lost in the parameters that should be used to achieve the result. What do you think would be the best way to do it?

Appendix: In order to create the initial dataframe, one could use:

names = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']
code1 = [112,211,242,245,311,312,np.nan,np.nan]
zone1 = [2.36, 1.13, 1.22, 3.87, 3.13, 7.11, np.nan, np.nan]
code2 = [112,122,211,242,243,311,312,324]
zone2 = [3.77, 2.53, 1.13, 1.38, 4.00, 2.07, 7.11, 1.06]
df = pd.DataFrame({'Name': names, 'C1': code1, 'Value_1': zone1, 'C2': code2, 'Value_2': zone2})

CodePudding user response：

You are almost there:

left.merge(right, right_on = "C2", left_on = "C1", how="right").fillna(0)

Output

Name_x	C1	Value_1	Name_y	C2	Value_2
A	112	2.36	A	112	3.77
0	0	0	A	122	2.53
A	211	1.13	A	211	1.13
A	242	1.22	A	242	1.38
0	0	0	A	243	4
A	311	3.13	A	311	2.07
A	312	7.11	A	312	7.11
0	0	0	A	324	1.06

CodePudding user response：

IIUC, you can perform an outer merge, then dropna on the missing values:

(df[['Name', 'C1', 'Value_1']]
 .merge(df[['Name', 'C2', 'Value_2']],
        left_on=['Name', 'C1'], right_on=['Name', 'C2'], how='outer')
 .dropna(subset=['C1', 'C2'], how='all')
)

output:

   Name     C1  Value_1     C2  Value_2
0     A  112.0     2.36  112.0     3.77
1     A  211.0     1.13  211.0     1.13
2     A  242.0     1.22  242.0     1.38
3     A  245.0     3.87    NaN      NaN
4     A  311.0     3.13  311.0     2.07
5     A  312.0     7.11  312.0     7.11
8     A    NaN      NaN  122.0     2.53
9     A    NaN      NaN  243.0     4.00
10    A    NaN      NaN  324.0     1.06