Home > other >  Comparing 2 dataframes
Comparing 2 dataframes

Time:07-06

I have 2 dataframes that contain 3 account indictors per account number. The account numbers are like for like in the column labelled "account". I am trying to modify dataframe 2 so it matches dataframe 1 in terms of having the same NaN values for each column.

Dataframe 1:

import pandas as pd
import numpy as np
df = pd.DataFrame([[1234567890,1,np.nan,'G'],
                    [7854567890,np.nan,100,np.nan],
                    [7854567899,np.nan,np.nan,np.nan],
                    [7854567893,np.nan,100,np.nan],
                    [7854567893,np.nan,np.nan,np.nan],      
                    [9632587415,np.nan,np.nan,'B']],
columns = ['account','ind_1','ind_2','ind_3'])
df

Output:

account         ind_1   ind_2   ind_3
0   1234567890  1.0     NaN     G
1   7854567890  NaN     100.0   NaN
2   7854567899  NaN     NaN     NaN
3   7854567893  NaN     100.0   NaN
4   7854567893  NaN     NaN     NaN
5   9632587415  NaN     NaN     B

Dataframe 2:

df2 = pd.DataFrame([[1234567890,5,np.nan,'GG'],
                    [7854567890,1,106,np.nan],
                    [7854567899,np.nan,100,'N'],
                    [7854567893,np.nan,100,np.nan],
                    [7854567893,np.nan,np.nan,np.nan],      
                    [9632587415,3,np.nan,'B']],
columns = ['account','ind_1','ind_2','ind_3'])
df2

Output:

account         ind_1   ind_2   ind_3
0   1234567890  5.0     NaN     GG
1   7854567890  1.0     106.0   NaN
2   7854567899  NaN     100.0   N
3   7854567893  NaN     100.0   NaN
4   7854567893  NaN     NaN     NaN
5   9632587415  3.0     NaN     B

Problem: I need to change dataframe 2 so that it is matching the same NaN values in dataframe 1.

For example: Column ind_1 has values in index 0, 1 and 5 in df2, whereas in df1 it only has values in index 0. I need to replace the values in index 1 and 5 in df2 with NaNs to match the same number of Nans in df1. Same logic to applied to the other 2 columns.

Expected outcome:

account         ind_1   ind_2   ind_3
0   1234567890  5.0     NaN     GG
1   7854567890  NaN     106.0   NaN
2   7854567899  NaN     NaN     NaN
3   7854567893  NaN     100.0   NaN
4   7854567893  NaN     NaN     NaN
5   9632587415  NaN     NaN     B  

Is there any easy way to achieve this?

Thanks in advance Alan

CodePudding user response:

Try this

df2[~df.isna()]

df.isna() check where df has nan value and create a mask then slice df2 with the mask.

CodePudding user response:

You return a DataFrame with True values where there are NaN values in your first dataframe using df.isna(). You can use this as a Boolean mask to set the correct cells to NaN in your second DataFrame:

df2[df.isna()] = None

Be careful: I assume that you also need to take care that these NaNs will be associated with rows with the same value for account. This solution does not ensure that this will happen, i.e. it will assume that the values in account column are exactly int he same order in both dataframes.

  • Related