I'd like to merge two DataFrams that contains two common columns. They have the same number of row and I know the order in both columns is the same, so they are already aligned. My problem is that, after they've merged I'm left with more rows than I originally had.
Is there a way to merge these two DataFrames and keep the original number if rows?
df1 = pd.DataFrame(
[
{"col1": 1, "col2": 1, "unique_df1_val": "value1"},
{"col1": 2, "col2": 2, "unique_df1_val": "value2"},
{"col1": 2, "col2": 2, "unique_df1_val": "value3"},
]
)
df2 = pd.DataFrame(
[
{"col1": 1, "col2": 1, "unique_df2_val": "value4"},
{"col1": 2, "col2": 2, "unique_df2_val": "value5"},
{"col1": 2, "col2": 2, "unique_df2_val": "value6"},
]
)
### Do some merge of the two ###
# Expected DataFrame
col1 col2 unique_df1_val unique_df2_val
0 1 1 value1 value4
1 2 2 value2 value5
2 2 2 value3 value6
I've tried using the df1.merge(df2, how="outer"), but this doesn't give me the correct output.
df1.merge(df2, how="outer")
# Returns
col1 col2 unique_df1_val unique_df2_val
0 1 1 value1 value4
1 2 2 value2 value5
2 2 2 value2 value6
3 2 2 value3 value5
4 2 2 value3 value6
CodePudding user response:
If they are all equally in length and previously sorted, with the same number of observations per col1 and col2, consider using join instead of merge. However be cautious since the operation is on indexes (by default), rather than column values:
Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.
df1.join(df2,lsuffix='drop').drop(columns=[x 'drop' for x in df1.columns if x in df2.columns])
Output:
unique_df1_val col1 col2 unique_df2_val
0 value1 1 1 value4
1 value2 2 2 value5
2 value3 2 2 value6
If a merge operation is not needed because the data will always be perfectly aligned, then just consider defining the new column:
df1['unique_df2_val'] = df2['unique_df2_val']
CodePudding user response:
For an efficient solution and a real merge, deduplicate before the merge
:
cols = ['col1', 'col2']
out = (df1.merge(df2, left_on=[df1.groupby(cols).cumcount(), *cols],
right_on=[df1.groupby(cols).cumcount(), *cols])
.drop(columns='key_0')
)
Output:
col1 col2 unique_df1_val unique_df2_val
0 1 1 value1 value4
1 2 2 value2 value5
2 2 2 value3 value6
CodePudding user response:
Use pd.concat
instead of merge
. If you have your data aligned, you could just concatenate the column you want in the final of the dataframe.
pd.concat([df1, df2['unique_df2_val']], axis=1)
Output:
col1 col2 unique_df1_val unique_df2_val
0 1 1 value1 value4
1 2 2 value2 value5
2 2 2 value3 value6