Home > Software design >  Best way to concat/merge list of pandas dataframes into one while preserving all values in a given c
Best way to concat/merge list of pandas dataframes into one while preserving all values in a given c

Time:12-14

I have a list of pandas dataframes as below and would like to concat/merge them so that the values in the shared column are exhaustive between all of the dataframes. What is the best approach?

DF 1:

Col1 Col2
BLAH1 A
BLAH2 Z

DF 2:

Col1 Col2 Col3
BLAH2 Z B
BLAH3 Q K

DF 3:

Col1 Col4
BLAH2 C
BLAH3 W

Desired Outcome

Col1 Col2 Col3 Col4
BLAH1 A NaN NaN
BLAH2 Z B C
BLAH3 Q K W

CodePudding user response:

We can use reduce and merge like so :

from functools import reduce

reduce(lambda left, right: pd.merge(left, right, on='Col1'), [df1, df2, df3])

Here the reduce apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
The left argument, left, is the accumulated value and the right argument, right, is the update value from the iterable.

CodePudding user response:

If the keys are unique, within each dataframe, you can do a concat then groupby:

list_dfs = [df1, df2, df3]
pd.concat(list_dfs).groupby('Col1').first()

In general, you can combine a reduce and merge:

from functools import reduce

# find common columns
commons = reduce(lambda x,y: set(x).intersection(set(y)), list_dfs)

reduce(lambda x,y: x.merge(y, on=commons), list_dfs)

CodePudding user response:

This article explains all the ins and outs of merge, join concatination and append better than I ever could. I would suggest reading this: [1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

CodePudding user response:

Let's use functools.reduce with pd.DataFrame.combine_first and comprehension:

from functools import reduce
reduce(lambda x, y: x.combine_first(y), 
       (df.set_index('Col1') for df in [df1,df2,df3])).reset_index()

Output:

    Col1 Col2 Col3 Col4
0  BLAH1    A  NaN  NaN
1  BLAH2    Z    B    C
2  BLAH3    Q    K    W

Given input dataframes as:

df1 = pd.DataFrame({'Col1':['BLAH1', 'BLAH2'],
                   'Col2':[*'AZ']})
df2 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
                   'Col2':[*'ZQ'],
                   'Col3':[*'BK']})
df3 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
                    'Col4':[*'CW']})
  • Related