Concatenating two dataframes with one common column and replacing the values of the column with the-CodePudding

I have two data sets. I need to concatenate them and replace the values of one column in the dataframe with the values of another column in the other dataframe.

Example data:

df1

date        col1   col2   col3
2020-03-27  NaN    0      NaN
2020-03-28  1      NaN    NaN
2020-03-29  2      4      2
2020-03-30  3      3      NaN
2020-03-31  3      2      NaN
2020-04-01  NaN    2      NaN

df2

date        col3
2020-03-27  NaN
2020-03-28  1
2020-03-29  2
2020-03-30  3
2020-03-31  NaN

As a result of the union, I want to get the following result:

df_res

date        col1   col2   col3
2020-03-27  NaN    0      NaN
2020-03-28  1      NaN    1
2020-03-29  2      4      2
2020-03-30  3      3      3
2020-03-31  3      2      NaN
2020-04-01  NaN    2      NaN

I tried pandas update but it didn't work for me.

Worked best: df_res = pd.merge(df1, df2, how='left', on='date'), but it creates stores two versions of the columns:

df_res

date        col1   col2   col3_x   col3_y
2020-03-27  NaN    0      NaN      NaN
2020-03-28  1      NaN    NaN      1
2020-03-29  2      4      2        2
2020-03-30  3      3      NaN      3
2020-03-31  3      2      NaN      NaN
2020-04-01  NaN    2      NaN      NaN

How can this merger be done?

CodePudding user response：

You can use pd.DataFrame.combine_first

Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame.

Doing:

df1 = df1.set_index('date')
df2 = df2.set_index('date')

df = df1.combine_first(df2)
print(df)

Output:

            col1  col2  col3
date
2020-03-27   NaN   0.0   NaN
2020-03-28   1.0   NaN   1.0
2020-03-29   2.0   4.0   2.0
2020-03-30   3.0   3.0   3.0
2020-03-31   3.0   2.0   NaN
2020-04-01   NaN   2.0   NaN

CodePudding user response：

After merging in df_res, you can do this maybe;

df_res["col3"] = df_res.apply(lambda x: x["col3_x"] if pd.isna(x["col3_y"]) else x["col3_x"],axis=1)

CodePudding user response：

Another possible solution, which is based on the idea of first using date as index in both dataframes and then assigning col3 of df2 to col3 of df1:

out = df1.set_index('date')
df2 = df2.set_index('date')

out['col3'] = df2['col3']
out = out.reset_index()

Output:

         date  col1  col2  col3
0  2020-03-27   NaN   0.0   NaN
1  2020-03-28   1.0   NaN   1.0
2  2020-03-29   2.0   4.0   2.0
3  2020-03-30   3.0   3.0   3.0
4  2020-03-31   3.0   2.0   NaN
5  2020-04-01   NaN   2.0   NaN