Home > OS >  How to merge two data frames having same column names horizontally on basis of similar values in one
How to merge two data frames having same column names horizontally on basis of similar values in one

Time:12-20

I have two data frames as shown below:

A B C D
Red 36 1 type-1
Blue 78 2 type-1
Green 59 3 type-1
A B C D
Orange 78 5 type-2
Purple 59 7 type-2
Brown 36 9 type-2

I want to merge the above two data frames on the basis of column B and after merge I want to keep the same columns as shown below:

A B C D A B C D
Red 36 1 type-1 Brown 36 9 type-2
Blue 78 2 type-1 Orange 78 5 type-2
Green 59 3 type-1 Purple 59 7 type-2

Is it possible to do this using pandas or any other python function?

I have tried using pd.merge function but I needed to change the column names. There exists another function called pd.concat but can I provide the column name (column 'B') in it for merging?

Thanks a lot in advance!

CodePudding user response:

apply rename to jezrael's anwer and you will get desired output

out = (df1.merge(df2, left_on=df1.B, right_on=df2.B).drop('key_0', axis=1)
       .rename(columns=lambda x: x.split('_')[0]))

out

    A       B   C   D       A       B   C   D
0   Red     36  1   type-1  Brown   36  9   type-2
1   Blue    78  2   type-1  Orange  78  5   type-2
2   Green   59  3   type-1  Purple  59  7   type-2

CodePudding user response:

You can pass to parameters left_on and right_on columns from both DataFrames, so is created helper column key_0, which is removed after join by DataFrame.merge:

Notice: Pandas has problem with duplicated columns names, it is reason why merge rename them by suffix _x and _y

df = df1.merge(df2, left_on=df1.B, right_on=df2.B).drop('key_0', axis=1)
print (df)
     A_x  B_x  C_x     D_x     A_y  B_y  C_y     D_y
0    Red   36    1  type-1   Brown   36    9  type-2
1   Blue   78    2  type-1  Orange   78    5  type-2
2  Green   59    3  type-1  Purple   59    7  type-2

What is problem with same columns names:

If need select column first A expected ouput is get Series.

print (df.A_x)
0      Red
1     Blue
2    Green
Name: A_x, dtype: object

But if duplicated names get all columns in DataFrame, DONT DO IT:

df = df.rename(columns=lambda x: x.split('_')[0])
# print (df)

print (df.A)
       A       A
0    Red   Brown
1   Blue  Orange
2  Green  Purple
  • Related