Home > Software engineering >  Pandas: correct way to merge dataframes (not with pd.merge() apparently)
Pandas: correct way to merge dataframes (not with pd.merge() apparently)

Time:03-14

I read another question that probably has some similar problem (this one) but I couldn't understand the answer.

Consider 2 dataframes defined like this

df1 = pd.DataFrame({
    'A' : ['a','b','c','d'],
    'B': [1,np.nan,np.nan,4]
})
df2 = pd.DataFrame({
    'A' : ['a','c','b','d'],
    'B' : [np.nan, 8, 9, np.nan]

I want to merge them to fill blank cells. At first I used

df1.merge(df2, on='A')

but this caused my df1 to have 2 different columns named B_x and B_y. I also tried with different parameters for the .merge() method but still couldn't solve the issue.
The final dataframe should look like this one

df1 = pd.DataFrame({
    'A' : ['a','b','c','d'],
    'B': [1,9,8,4]
})

Do you know what's the most logic way to do that?
I think that pd.concat() should be a useful tool for this job but I have no idea on how to apply it.

EDIT: I modified values so that 'A' columns don't have the same order in both dataframes. This should avoid any ambiguity.

CodePudding user response:

Maybe you can use fillna instead:

new_df = df1.fillna(df2)

Output:

>>> new_df
   A    B
0  a  1.0
1  b  9.0
2  c  8.0
3  d  4.0

Here's a different solution:

df1.merge(df2, on='A', suffixes=(';',';')).pipe(lambda x: x.set_axis(x.columns.str.strip(';'), axis=1)).groupby(level=0, axis=1).first()

The semicolons (;) are arbitrary; you can use any character as long as it doesn't appear in any columns names.

CodePudding user response:

You can map the values from df2:

df1['B'] = df1['B'].fillna(df1['A'].map(df2.set_index('A')['B']))

output:

   A    B
0  a  1.0
1  b  9.0
2  c  8.0
3  d  4.0
Alternative

if the values in A are unique, you can merge only the column A of df1 and combine_first:

df1 = df1.combine_first(df1[['A']].merge(df2))
  • Related