Home > Software design >  Merge a list of dataframes by one column with reduce function
Merge a list of dataframes by one column with reduce function

Time:12-10

i have edited this post with the specific case:

i have a list of dataframes like this (note that df1 and df2 have a row in common)

df1

index Date A
0 2010-06-19 4
1 2010-06-20 3
2 2010-06-21 2
3 2010-06-22 1
4 2012-07-19 5

df2

index Date B
0 2012-07-19 5
1 2012-07-20 6

df3

index Date C
0 2020-06-19 5
1 2020-06-20 2
2 2020-06-21 9

df_list = [df1, df2, df3]

I would like to merge all dataframes in a single dataframe, without losing rows and placing nan where there are no things to merge. The criteria would be merging them by the column 'Date' (the column should have all the dates of all the merged dataframes, ordered by date).

The resulting dataframe should look like this:

Resulting Dataframe:

index Date A B C
0 2010-06-19 4 nan nan
1 2010-06-20 3 nan nan
2 2010-06-21 2 nan nan
3 2010-06-22 1 nan nan
4 2012-07-19 5 5 nan
5 2012-07-20 nan 6 nan
6 2020-06-19 nan nan 5
7 2020-06-20 nan nan 2
8 2020-06-21 nan nan 9

I tried something like this:

from functools import reduce

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['Date'], how='outer'), df_list)

BUT the resulting dataframe is not as expected (i miss some columns and is not ordered by date). I think i am missing something.

Thank you very much

CodePudding user response:

Use pandas.concat(). It takes a list of dataframes, and appends common columns together, filling new columns with NaN as necessary:

new_df = pd.concat([df1, df2, df3])

Output:

>>> new_df
   index        Date    A    B    C
0      0  2010-06-19  4.0  NaN  NaN
1      1  2010-06-20  3.0  NaN  NaN
2      2  2010-06-21  2.0  NaN  NaN
3      3  2010-06-22  1.0  NaN  NaN
0      0  2012-07-19  NaN  5.0  NaN
1      1  2012-07-20  NaN  6.0  NaN
0      0  2020-06-19  NaN  NaN  5.0
1      1  2020-06-20  NaN  NaN  2.0
2      2  2020-06-21  NaN  NaN  9.0

CodePudding user response:

SOLVED: for overlapping datas, i had to add the option: Sort = TRUE in the lambda function. Seemed i was missing the order for big dataframes and i was only seeng the nan at end and start of frames. Thank you all ;-)

from functools import reduce

df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Date'], how='outer', sort=True), df_list)

  • Related