I have a list of pandas dataframes as below and would like to concat/merge them so that the values in the shared column are exhaustive between all of the dataframes. What is the best approach?
DF 1:
Col1 | Col2 |
---|---|
BLAH1 | A |
BLAH2 | Z |
DF 2:
Col1 | Col2 | Col3 |
---|---|---|
BLAH2 | Z | B |
BLAH3 | Q | K |
DF 3:
Col1 | Col4 |
---|---|
BLAH2 | C |
BLAH3 | W |
Desired Outcome
Col1 | Col2 | Col3 | Col4 |
---|---|---|---|
BLAH1 | A | NaN | NaN |
BLAH2 | Z | B | C |
BLAH3 | Q | K | W |
CodePudding user response:
We can use reduce
and merge
like so :
from functools import reduce
reduce(lambda left, right: pd.merge(left, right, on='Col1'), [df1, df2, df3])
Here the reduce
apply function of two arguments cumulatively to the items of iterable, from left to right, so as to reduce the iterable to a single value.
The left argument, left
, is the accumulated value and the right argument, right
, is the update value from the iterable.
CodePudding user response:
If the keys are unique, within each dataframe, you can do a concat then groupby:
list_dfs = [df1, df2, df3]
pd.concat(list_dfs).groupby('Col1').first()
In general, you can combine a reduce
and merge
:
from functools import reduce
# find common columns
commons = reduce(lambda x,y: set(x).intersection(set(y)), list_dfs)
reduce(lambda x,y: x.merge(y, on=commons), list_dfs)
CodePudding user response:
This article explains all the ins and outs of merge, join concatination and append better than I ever could. I would suggest reading this: [1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
CodePudding user response:
Let's use functools.reduce
with pd.DataFrame.combine_first
and comprehension:
from functools import reduce
reduce(lambda x, y: x.combine_first(y),
(df.set_index('Col1') for df in [df1,df2,df3])).reset_index()
Output:
Col1 Col2 Col3 Col4
0 BLAH1 A NaN NaN
1 BLAH2 Z B C
2 BLAH3 Q K W
Given input dataframes as:
df1 = pd.DataFrame({'Col1':['BLAH1', 'BLAH2'],
'Col2':[*'AZ']})
df2 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
'Col2':[*'ZQ'],
'Col3':[*'BK']})
df3 = pd.DataFrame({'Col1':['BLAH2', 'BLAH3'],
'Col4':[*'CW']})