Home > other >  How to complete NaN cells based on another Pandas dataframe in Python
How to complete NaN cells based on another Pandas dataframe in Python

Time:09-02

I have the following 2 dataframes..

First dataframe df1:

import pandas as pd
import numpy as np

d1 = {'id': [1, 2, 3, 4], 'col1': [13, np.nan, 15, np.nan], 'col2': [23, np.nan, np.nan, np.nan]}
df1 = pd.DataFrame(data=d1)
df1

    id  col1    col2
0   1   13.0    23.0
1   2   NaN     NaN
2   3   15.0    NaN
3   4   NaN     NaN

And the second dataframe df2:

d2 = {'id': [2, 3, 4], 'col1': [ 14, 150, 16], 'col2': [24, 250, np.nan]}
df2 = pd.DataFrame(data=d2)
df2

    id  col1    col2
0   2   14      24.0
1   3   150     250.0
2   4   16      NaN

I need to replace the NaN fields in df1 with the non-NaN values from df2, where it is possible. But there are some conditions...

Condition 1) id column in each dataframe consists of unique values. When replacing any NaN value in df1 with another value from df2, the id column value needs to match.

Condition 2) Dataframes do not necessarily have the same size.

Condition 3) NaN values will only be looked for in col1 or col2 in any of the dataframes. The id column cannot be NaN in any row. There might be other columns in the dataframes, with or without NaN values. But for replacing the data, we will only be looking at col1 and col2 columns.

Condition 4) To go for a replacement of a row in df1, it is enough that any of col1 or col2 have a NaN value in any corresponding row. And when any NaN value is detected in any row in df1, the entire row will be replaced by the corresponding row with the same id value from df2, as long as all values of col1 and col2 in the corresponding row of df2 are non-NaN. With other words, if the row with the same id value in df2 have NaN value in any of col1 or col2, do not replace any data in df1.

After doing this operation, the df1 should look like the following:

    id  col1    col2
0   1   13.0    23.0
1   2   14      24    
2   3   150.0   250.0    # Note that the entire row is replaced!
3   4   NaN     NaN      # This row not replaced bcz col2 value is NaN in df2 for the same row

How can this be done in the most elegant way? Python offers a lot of functions that I may not be aware of, which maybe solves this problem in a few rows instead of writing a very complex logic.

CodePudding user response:

You can drop the NaN values from df2, then update with concat and groupby:

pd.concat([df2.dropna(), df1]).groupby('id', as_index=False).first()

Output:

   id   col1   col2
0   1   13.0   23.0
1   2   14.0   24.0
2   3  150.0  250.0
3   4    NaN    NaN

CodePudding user response:

here is another way using fillna:

df1 = df1.set_index('id').fillna(df2.dropna().set_index('id')).reset_index()

output:

>>>
   id  col1   col2
0   1  13.0   23.0
1   2  14.0   24.0
2   3  15.0  250.0
3   4   NaN    NaN
  • Related