Home > database >  Merging multiple data frames causing duplicate column names
Merging multiple data frames causing duplicate column names

Time:12-30

This thread here suggested to use reduce to merge multiple data frames at once.

df1= pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
df2= pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
df3= pd.DataFrame({'key': ['A', 'C', 'E', 'F'], 'value': np.random.randn(4)})
df4= pd.DataFrame({'key': ['A', 'B', 'C', 'F'], 'value': np.random.randn(4)})

df_list = [df1, df2, df3, df4]

from functools import reduce
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['key'], how='outer'), df_list )

The merge is executed, but there is a warning

<stdin>:1: FutureWarning: Passing 'suffixes' which cause duplicate columns {'value_x'} in the result is deprecated and will raise a MergeError in a future version.

and df_meged has columns with exactly duplicated names. How do I force distinct names for each column?

CodePudding user response:

You can do

s = pd.concat([x.set_index('key') for x in df_list],axis = 1,keys=range(len(df_list)))
s.columns = s.columns.map('{0[1]}_{0[0]}'.format)
s = s.reset_index()
s
Out[236]: 
  key   value_0   value_1   value_2   value_3
0   A -1.957968       NaN -0.852135 -0.976960
1   B  1.545932 -0.276838       NaN  0.197615
2   C -2.149727       NaN -0.364382  0.349993
3   D  0.524990 -0.476655       NaN       NaN
4   E       NaN -2.135870  0.798782       NaN
5   F       NaN  1.456544 -0.255705  0.447279

CodePudding user response:

An outer join in this scenario is tantamount to appending on the zero axis. You therefore are better of using the two columns in the dfs as the join keys. code below

from functools import reduce
df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['key', 'value'], how='outer'), df_list )
  • Related