Panda's evaluating NULL to NULL as false when comparing columns-CodePudding

I am creating a python program to do adhoc comparisons of different source files in a OLD -> NEW style compare. Currently I am merging the DF's and using np.where to evaluate differences, the issue I am facing is when the data in both comparison columns is null it evaluates as a difference rather than no difference. The existence of a difference is calculated in a new column labeled "diff".

DF1

CODE	EFF_DT
XYZ	1/1/2022
ABC	1/1/2022
123

DF2

CODE	EFF_DT
XYZ	4/1/2022
ABC	1/1/2022
123

ColNames= ['CODE', 'EFF_DT']

df1 = pd.read_excel(sourcefile1.xlsx, header=1, names=ColNames)
df2 = pd.read_excel(sourcefile2.xlsx, header=1, names=ColNames)

dfResults = pd.merge(df1, df2, how='outer', left_on='code', 
                     right_on='code', suffixes=('_old','_new')

#remove key from list
ColNames.remove('CODE')

#do compare and calculate diffs, done this way to evaluate N number cols

for z in names:
    dfResults[z,'diff'] = np.where(dfResults[z   '_old'] ==
                                   dfResults[z   '_new'],  '', 'Diff')

#filter out rows with no diff
dfResults  = dfResults[dfResults.isin(['Diff']).any(axis=1)]

print(dfResults)

Actual Result:

CODE	EFF_DT_old	EFF_DT_new	(EFF_DT), 'diff'
XYZ	1/1/2022	4/1/2022	Diff
123			Diff

Expected Result:

CODE	EFF_DT_old	EFF_DT_new	(EFF_DT), 'diff'
XYZ	1/1/2022	4/1/2022	Diff

How do I make the np.where condition evaluate null == null as TRUE instead of FALSE?

CodePudding user response：

Why do you make a string and not a boolean out of your diff column? Besides why do you loop over names?

I think what you're trying to achieve can be done without where and dropping rows with None/NULL/NaN:

dfResults = pd.merge(df1, df2, how='outer', left_on='CODE', 
                     right_on='CODE', suffixes=('_old','_new'))

dfResults['diff'] = dfResults['EFF_DT_old'] !=  dfResults['EFF_DT_new']
print(dfResults[dfResults['diff']].dropna())

Output:

  CODE EFF_DT_old EFF_DT_new  diff
0  XYZ   1/1/2022   4/1/2022  True

Does this address your problem?

CodePudding user response：

By filling the Null column with a value I was able to produce the expected result.

ColNames= ['CODE', 'EFF_DT']

df1 = pd.read_excel(sourcefile1.xlsx, header=1, names=ColNames)
df2 = pd.read_excel(sourcefile2.xlsx, header=1, names=ColNames)

for i in ColNames:
    df1[i].fillna("Nan",inplace=True)
    df2[i].fillna("Nan",inplace=True)

dfResults = pd.merge(df1, df2, how='outer', left_on='code', 
                     right_on='code', suffixes=('_old','_new')

#remove key from list
ColNames.remove('CODE')

#do compare and calculate diffs, done this way to evaluate N number cols

for z in names:
    dfResults[z,'diff'] = np.where(dfResults[z   '_old'] ==
                                   dfResults[z   '_new'],  '', 'Diff')

#filter out rows with no diff
dfResults  = dfResults[dfResults.isin(['Diff']).any(axis=1)]

print(dfResults)

CODE	EFF_DT_old	EFF_DT_new	(EFF_DT), 'diff'
XYZ	1/1/2022	4/1/2022	Diff