Best way to concat/merge list of pandas dataframes into one while preserving all values in a given c-CodePudding

I have a list of pandas dataframes as below and would like to concat/merge them so that the values in the shared column are exhaustive between all of the dataframes. What is the best approach?

DF 1:

Col1	Col2
BLAH1	A
BLAH2	Z

DF 2:

Col1	Col2	Col3
BLAH2	Z	B
BLAH3	Q	K

DF 3:

Col1	Col4
BLAH2	C
BLAH3	W

Desired Outcome

Col1	Col2	Col3	Col4
BLAH1	A	NaN	NaN
BLAH2	Z	B	C
BLAH3	Q	K	W

CodePudding user response：

We can use reduce and merge like so :

from functools import reduce

reduce(lambda left, right: pd.merge(left, right, on='Col1'), [df1, df2, df3])

Here the reduce apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
The left argument, left, is the accumulated value and the right argument, right, is the update value from the iterable.

CodePudding user response：

If the keys are unique, within each dataframe, you can do a concat then groupby:

list_dfs = [df1, df2, df3]
pd.concat(list_dfs).groupby('Col1').first()

In general, you can combine a reduce and merge:

from functools import reduce

# find common columns
commons = reduce(lambda x,y: set(x).intersection(set(y)), list_dfs)

reduce(lambda x,y: x.merge(y, on=commons), list_dfs)

CodePudding user response：

This article explains all the ins and outs of merge, join concatination and append better than I ever could. I would suggest reading this: [1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

CodePudding user response：

Let's use functools.reduce with pd.DataFrame.combine_first and comprehension:

from functools import reduce
reduce(lambda x, y: x.combine_first(y), 
       (df.set_index('Col1') for df in [df1,df2,df3])).reset_index()

Output:

    Col1 Col2 Col3 Col4
0  BLAH1    A  NaN  NaN
1  BLAH2    Z    B    C
2  BLAH3    Q    K    W

Given input dataframes as:

df1 = pd.DataFrame({'Col1':['BLAH1', 'BLAH2'],
                   'Col2':[*'AZ']})
df2 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
                   'Col2':[*'ZQ'],
                   'Col3':[*'BK']})
df3 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
                    'Col4':[*'CW']})