I'm trying to generate the input dataset for multi-task learning, but I faced some problems during concatenating the datasets.
This is the example dataframes for merging.
import pandas as pd
df1 = pd.DataFrame({
'sample': ['A', 'B', 'C', 'D'],
'group': [1,0,1,0],
'value': [123, 64, 534, 873]
})
df2 = pd.DataFrame({
'sample': ['A', 'D', 'E'],
'group': [1,1,0],
'value': [372, 981, 23]
})
df1
# sample group value
#0 A 1 123
#1 B 0 64
#2 C 1 534
#3 D 0 873
df2
# sample group value
#0 A 1 372
#1 D 1 981
#2 E 0 23
Expected result
#group_x and group_y columns do not need to be the same name as the expected result.
df3
# sample group_x group_y value
#0 A 1 NaN 123
#1 B 0 NaN 64
#2 C 1 NaN 534
#3 D 0 NaN 873
#4 A NaN 1 372
#5 D NaN 1 981
#6 E NaN 0 23
I tried pd.concat
and pd.merge
for appending two dataframes because they share the same columns.
But 1) with pd.concat
, I could not append group columns horizontally, and 2) pd.merge
expand columns widely.
Is there any recommended function for this job?
CodePudding user response:
You need a outer merge but on columns ['sample', 'value']
out = df1.merge(df2, how='outer', on=['sample', 'value'])
print(out)
sample group_x value group_y
0 A 1.0 123 NaN
1 B 0.0 64 NaN
2 C 1.0 534 NaN
3 D 0.0 873 NaN
4 A NaN 372 1.0
5 D NaN 981 1.0
6 E NaN 23 0.0