I have the following 2 dataframes..
First dataframe df1:
import pandas as pd
import numpy as np
d1 = {'id': [1, 2, 3, 4], 'col1': [13, np.nan, 15, np.nan], 'col2': [23, np.nan, np.nan, np.nan]}
df1 = pd.DataFrame(data=d1)
df1
id col1 col2
0 1 13.0 23.0
1 2 NaN NaN
2 3 15.0 NaN
3 4 NaN NaN
And the second dataframe df2:
d2 = {'id': [2, 3, 4], 'col1': [ 14, 150, 16], 'col2': [24, 250, np.nan]}
df2 = pd.DataFrame(data=d2)
df2
id col1 col2
0 2 14 24.0
1 3 150 250.0
2 4 16 NaN
I need to replace the NaN fields in df1 with the non-NaN values from df2, where it is possible. But there are some conditions...
Condition 1) id column in each dataframe consists of unique values. When replacing any NaN value in df1 with another value from df2, the id column value needs to match.
Condition 2) Dataframes do not necessarily have the same size.
Condition 3) NaN values will only be looked for in col1 or col2 in any of the dataframes. The id column cannot be NaN in any row. There might be other columns in the dataframes, with or without NaN values. But for replacing the data, we will only be looking at col1 and col2 columns.
Condition 4) To go for a replacement of a row in df1, it is enough that any of col1 or col2 have a NaN value in any corresponding row. And when any NaN value is detected in any row in df1, the entire row will be replaced by the corresponding row with the same id value from df2, as long as all values of col1 and col2 in the corresponding row of df2 are non-NaN. With other words, if the row with the same id value in df2 have NaN value in any of col1 or col2, do not replace any data in df1.
After doing this operation, the df1 should look like the following:
id col1 col2
0 1 13.0 23.0
1 2 14 24
2 3 150.0 250.0 # Note that the entire row is replaced!
3 4 NaN NaN # This row not replaced bcz col2 value is NaN in df2 for the same row
How can this be done in the most elegant way? Python offers a lot of functions that I may not be aware of, which maybe solves this problem in a few rows instead of writing a very complex logic.
CodePudding user response:
You can drop the NaN
values from df2
, then update with concat
and groupby
:
pd.concat([df2.dropna(), df1]).groupby('id', as_index=False).first()
Output:
id col1 col2
0 1 13.0 23.0
1 2 14.0 24.0
2 3 150.0 250.0
3 4 NaN NaN
CodePudding user response:
here is another way using fillna
:
df1 = df1.set_index('id').fillna(df2.dropna().set_index('id')).reset_index()
output:
>>>
id col1 col2
0 1 13.0 23.0
1 2 14.0 24.0
2 3 15.0 250.0
3 4 NaN NaN