Home > Software design >  How to concat two dataframes with duplicate column names?
How to concat two dataframes with duplicate column names?

Time:06-23

reproducible data

import pandas as pd
import numpy as np

cols1=['b','a','c','a']
data1=[0,0,0,0]
df1=pd.DataFrame([data1], columns= cols1)
df1

cols2=['b','a', 'd', 'a', 'e','f']
data2=[1,1,1,1,1,1]
df2=pd.DataFrame([data2], columns= cols2)
df2

result I want

data = { "b": [0, 1],
        "b a" : [0, 1],
        "c" : [0, np.NaN],
        "c a" : [0, np.NaN],
       "d" : [np.NaN, 1],
       "d a" : [np.NaN, 1],
       "e" : [np.NaN, 1],
       "f" : [np.NaN, 1]}
pd.DataFrame(data)

If df have duplicate column name "a", I cannot use "concat" function.

Any good way to deal with duplicate column names?

If there is a "b" before "a", I want to change the corresponding a to "b a".

CodePudding user response:

It is not standard deduplicated columns names, this working if not consecutive duplicates like a,b,b,b or b,a,b,a columns names:

s1 = df1.columns.to_series()
df1.columns = [f'{b} {a}' if c else a for a, b, c in zip(df1.columns, s1.shift(fill_value=''), s1.duplicated(keep=False))]


s2 = df2.columns.to_series()
df2.columns = [f'{b} {a}' if c else a for a, b, c in zip(df2.columns, s2.shift(fill_value=''), s2.duplicated(keep=False))]


df = pd.concat([df1, df2])
print (df)
   b  b a    c  c a    d  d a    e    f
0  0    0  0.0  0.0  NaN  NaN  NaN  NaN
0  1    1  NaN  NaN  1.0  1.0  1.0  1.0

Obviously deduplicate columns names is done by positions, here is a duplicated, so added 1. But output is different:

s1 = df1.columns.to_series()
df1.columns = s1.str.cat(s1.groupby(s1).cumcount().astype(str), sep=' ').str.replace(' 0','', regex=True)

s2 = df2.columns.to_series()
df2.columns = s2.str.cat(s2.groupby(s2).cumcount().astype(str), sep=' ').str.replace(' 0','', regex=True)


df = pd.concat([df1, df2], ignore_index=True)
print (df)
   b  a    c  a 1    d    e    f
0  0  0  0.0    0  NaN  NaN  NaN
1  1  1  NaN    1  1.0  1.0  1.0
  • Related