I am creating a python program to do adhoc comparisons of different source files in a OLD -> NEW style compare. Currently I am merging the DF's and using np.where to evaluate differences, the issue I am facing is when the data in both comparison columns is null it evaluates as a difference rather than no difference. The existence of a difference is calculated in a new column labeled "diff".
DF1
CODE | EFF_DT |
---|---|
XYZ | 1/1/2022 |
ABC | 1/1/2022 |
123 |
DF2
CODE | EFF_DT |
---|---|
XYZ | 4/1/2022 |
ABC | 1/1/2022 |
123 |
ColNames= ['CODE', 'EFF_DT']
df1 = pd.read_excel(sourcefile1.xlsx, header=1, names=ColNames)
df2 = pd.read_excel(sourcefile2.xlsx, header=1, names=ColNames)
dfResults = pd.merge(df1, df2, how='outer', left_on='code',
right_on='code', suffixes=('_old','_new')
#remove key from list
ColNames.remove('CODE')
#do compare and calculate diffs, done this way to evaluate N number cols
for z in names:
dfResults[z,'diff'] = np.where(dfResults[z '_old'] ==
dfResults[z '_new'], '', 'Diff')
#filter out rows with no diff
dfResults = dfResults[dfResults.isin(['Diff']).any(axis=1)]
print(dfResults)
Actual Result:
CODE | EFF_DT_old | EFF_DT_new | (EFF_DT), 'diff' |
---|---|---|---|
XYZ | 1/1/2022 | 4/1/2022 | Diff |
123 | Diff |
Expected Result:
CODE | EFF_DT_old | EFF_DT_new | (EFF_DT), 'diff' |
---|---|---|---|
XYZ | 1/1/2022 | 4/1/2022 | Diff |
How do I make the np.where condition evaluate null == null as TRUE instead of FALSE?
CodePudding user response:
Why do you make a string and not a boolean out of your diff
column? Besides why do you loop over names
?
I think what you're trying to achieve can be done without where
and dropping rows with None
/NULL
/NaN
:
dfResults = pd.merge(df1, df2, how='outer', left_on='CODE',
right_on='CODE', suffixes=('_old','_new'))
dfResults['diff'] = dfResults['EFF_DT_old'] != dfResults['EFF_DT_new']
print(dfResults[dfResults['diff']].dropna())
Output:
CODE EFF_DT_old EFF_DT_new diff
0 XYZ 1/1/2022 4/1/2022 True
Does this address your problem?
CodePudding user response:
By filling the Null column with a value I was able to produce the expected result.
ColNames= ['CODE', 'EFF_DT']
df1 = pd.read_excel(sourcefile1.xlsx, header=1, names=ColNames)
df2 = pd.read_excel(sourcefile2.xlsx, header=1, names=ColNames)
for i in ColNames:
df1[i].fillna("Nan",inplace=True)
df2[i].fillna("Nan",inplace=True)
dfResults = pd.merge(df1, df2, how='outer', left_on='code',
right_on='code', suffixes=('_old','_new')
#remove key from list
ColNames.remove('CODE')
#do compare and calculate diffs, done this way to evaluate N number cols
for z in names:
dfResults[z,'diff'] = np.where(dfResults[z '_old'] ==
dfResults[z '_new'], '', 'Diff')
#filter out rows with no diff
dfResults = dfResults[dfResults.isin(['Diff']).any(axis=1)]
print(dfResults)
CODE | EFF_DT_old | EFF_DT_new | (EFF_DT), 'diff' |
---|---|---|---|
XYZ | 1/1/2022 | 4/1/2022 | Diff |