Home > Software engineering >  concat several pandas dataframes
concat several pandas dataframes

Time:06-29

df1 = pd.DataFrame({'a':['id1','id2','id3'],'b':['W','W','W'],'c1':[1,2,3]})
df2 = pd.DataFrame({'a':['id1','id2','id3'],'b':['W','W','W'],'c2':[4,5,6]})
df3 = pd.DataFrame({'a':['id1','id4','id5'],'b':['Q','Q','Q'],'c1':[7,8,9]})

I'm trying to concatenate df1 df2 and df3 into one dataframe:

a    b   c1   c2
id1  W   1    4
id2  W   2    5
id3  W   3    6
id1  Q   7    NA
id4  Q   8    NA
id5  Q   9    NA

I tried:

l = [d.set_index(['a','b']) for d in [df1,df2,df3]]
pd.concat(l, axis=1)

but the output isn't what I expected:

        c1   c2   c1
a   b               
id1 W  1.0  4.0  NaN
id2 W  2.0  5.0  NaN
id3 W  3.0  6.0  NaN
id1 Q  NaN  NaN  7.0
id4 Q  NaN  NaN  8.0
id5 Q  NaN  NaN  9.0

CodePudding user response:

You can join MultiIndex Series created by DataFrame.stack:

l = [d.set_index(['a','b']).stack() for d in [df1,df2,df3]]
df = pd.concat(l).unstack().sort_index(level=[1,0], ascending=[False, True])
print (df)
        c1   c2
a   b          
id1 W  1.0  4.0
id2 W  2.0  5.0
id3 W  3.0  6.0
id1 Q  7.0  NaN
id4 Q  8.0  NaN
id5 Q  9.0  NaN

If there is only 3 columns DataFrames use DataFrame.squeeze or select first column by iloc[:, 0] for list of Series:

l = [d.set_index(['a','b']).squeeze() for d in [df1,df2,df3]]
keys = [x.name for x in l]
df = (pd.concat(l, axis=0, keys=keys)
        .unstack(0)
        .sort_index(level=[1,0], ascending=[False, True]))
print (df)
        c1   c2
a   b          
id1 W  1.0  4.0
id2 W  2.0  5.0
id3 W  3.0  6.0
id1 Q  7.0  NaN
id4 Q  8.0  NaN
id5 Q  9.0  NaN

l = [d.set_index(['a','b']).iloc[:, 0] for d in [df1,df2,df3]]
keys = [x.name for x in l]
df = (pd.concat(l, axis=0, keys=keys)
        .unstack(0)
        .sort_index(level=[1,0], ascending=[False, True]))

Another idea is chain multiple DataFrames in list by DataFrame.combine_first:

from functools import reduce

dfs = [d.set_index(['a','b']) for d in [df1,df2,df3]]
df = (reduce(lambda x, y: x.combine_first(y), dfs)
        .sort_index(level=[1,0], ascending=[False, True]))
print (df)
        c1   c2
a   b          
id1 W  1.0  4.0
id2 W  2.0  5.0
id3 W  3.0  6.0
id1 Q  7.0  NaN
id4 Q  8.0  NaN
id5 Q  9.0  NaN

CodePudding user response:

First merge df1 and df2 by basing columns a and b;

df_try_1 = df1.merge(df2, on=["a","b"])

Then merge that with df3;

df_try_2 = pd.concat([df_try_1, df3], axis=0)

Result; enter image description here

CodePudding user response:

try

a=df1.merge(df2[['a','c2']],on='a',how='left')
l=a.append(df3)

CodePudding user response:

This should work in your case:

df = pd.merge(pd.merge(df1, df2, how='outer', on=['a', 'b']), df3, how='outer', on=['a', 'b'])
df.set_index(['a', 'b'], inplace=True)
df.columns = ['c1', 'c2', 'c3']
print(df)

Result:

        c1   c2   c3
a   b               
id1 W  1.0  4.0  NaN
id2 W  2.0  5.0  NaN
id3 W  3.0  6.0  NaN
id1 Q  NaN  NaN  7.0
id4 Q  NaN  NaN  8.0
id5 Q  NaN  NaN  9.0

CodePudding user response:

I think merge is your best shot for that.

df = df1.combine_first(df2)
pd.merge(df, df3, on=['a', 'b', 'c1'], how='outer')

This results in expected output:

     a  b  c1   c2
0  id1  W   1  4.0
1  id2  W   2  5.0
2  id3  W   3  6.0
3  id1  Q   7  NaN
4  id4  Q   8  NaN
5  id5  Q   9  NaN
  • Related