reproducible data
import pandas as pd
import numpy as np
cols1=['b','a','c','a']
data1=[0,0,0,0]
df1=pd.DataFrame([data1], columns= cols1)
df1
cols2=['b','a', 'd', 'a', 'e','f']
data2=[1,1,1,1,1,1]
df2=pd.DataFrame([data2], columns= cols2)
df2
result I want
data = { "b": [0, 1],
"b a" : [0, 1],
"c" : [0, np.NaN],
"c a" : [0, np.NaN],
"d" : [np.NaN, 1],
"d a" : [np.NaN, 1],
"e" : [np.NaN, 1],
"f" : [np.NaN, 1]}
pd.DataFrame(data)
If df have duplicate column name "a", I cannot use "concat" function.
Any good way to deal with duplicate column names?
If there is a "b" before "a", I want to change the corresponding a to "b a".
CodePudding user response:
It is not standard deduplicated columns names, this working if not consecutive duplicates like a,b,b,b
or b,a,b,a
columns names:
s1 = df1.columns.to_series()
df1.columns = [f'{b} {a}' if c else a for a, b, c in zip(df1.columns, s1.shift(fill_value=''), s1.duplicated(keep=False))]
s2 = df2.columns.to_series()
df2.columns = [f'{b} {a}' if c else a for a, b, c in zip(df2.columns, s2.shift(fill_value=''), s2.duplicated(keep=False))]
df = pd.concat([df1, df2])
print (df)
b b a c c a d d a e f
0 0 0 0.0 0.0 NaN NaN NaN NaN
0 1 1 NaN NaN 1.0 1.0 1.0 1.0
Obviously deduplicate columns names is done by positions, here is a
duplicated, so added 1
. But output is different:
s1 = df1.columns.to_series()
df1.columns = s1.str.cat(s1.groupby(s1).cumcount().astype(str), sep=' ').str.replace(' 0','', regex=True)
s2 = df2.columns.to_series()
df2.columns = s2.str.cat(s2.groupby(s2).cumcount().astype(str), sep=' ').str.replace(' 0','', regex=True)
df = pd.concat([df1, df2], ignore_index=True)
print (df)
b a c a 1 d e f
0 0 0 0.0 0 NaN NaN NaN
1 1 1 NaN 1 1.0 1.0 1.0