Merge two DataFrames 1:1-CodePudding

I'd like to merge two DataFrams that contains two common columns. They have the same number of row and I know the order in both columns is the same, so they are already aligned. My problem is that, after they've merged I'm left with more rows than I originally had.

Is there a way to merge these two DataFrames and keep the original number if rows?

df1 = pd.DataFrame(
   [
      {"col1": 1, "col2": 1, "unique_df1_val": "value1"},
      {"col1": 2, "col2": 2, "unique_df1_val": "value2"},
      {"col1": 2, "col2": 2, "unique_df1_val": "value3"},
   ]
)

df2 = pd.DataFrame(
   [
      {"col1": 1, "col2": 1, "unique_df2_val": "value4"},
      {"col1": 2, "col2": 2, "unique_df2_val": "value5"},
      {"col1": 2, "col2": 2, "unique_df2_val": "value6"},
   ]
)


### Do some merge of the two ###


# Expected DataFrame

    col1    col2    unique_df1_val  unique_df2_val
0   1       1       value1          value4
1   2       2       value2          value5
2   2       2       value3          value6

I've tried using the df1.merge(df2, how="outer"), but this doesn't give me the correct output.

df1.merge(df2, how="outer")
# Returns

    col1    col2    unique_df1_val  unique_df2_val
0   1       1       value1          value4
1   2       2       value2          value5
2   2       2       value2          value6
3   2       2       value3          value5
4   2       2       value3          value6

CodePudding user response：

If they are all equally in length and previously sorted, with the same number of observations per col1 and col2, consider using join instead of merge. However be cautious since the operation is on indexes (by default), rather than column values:

Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

df1.join(df2,lsuffix='drop').drop(columns=[x 'drop' for x in df1.columns if x in df2.columns])

Output:

  unique_df1_val  col1  col2 unique_df2_val
0         value1     1     1         value4
1         value2     2     2         value5
2         value3     2     2         value6

If a merge operation is not needed because the data will always be perfectly aligned, then just consider defining the new column:

df1['unique_df2_val'] = df2['unique_df2_val']

CodePudding user response：

For an efficient solution and a real merge, deduplicate before the merge:

cols = ['col1', 'col2']

out = (df1.merge(df2, left_on=[df1.groupby(cols).cumcount(), *cols],
                     right_on=[df1.groupby(cols).cumcount(), *cols])
          .drop(columns='key_0')
      )

Output:

   col1  col2 unique_df1_val unique_df2_val
0     1     1         value1         value4
1     2     2         value2         value5
2     2     2         value3         value6

CodePudding user response：

Use pd.concat instead of merge. If you have your data aligned, you could just concatenate the column you want in the final of the dataframe.

pd.concat([df1, df2['unique_df2_val']], axis=1)

Output:

    col1    col2    unique_df1_val  unique_df2_val
0   1         1        value1          value4
1   2         2        value2          value5
2   2         2        value3          value6