I have a Pandas dataframe with 3 columns and I plan to create a 4th column that is a compilation of the values in these. The business rules stipulate that if one column is not NULL then the other columns are NULL for a given row. However, I want to insert a check.
The asserts below are sufficient for confirming the business rules. However, this is not very scalable. What would be a better way to write this such that any number of variables can be added?
import pandas as pd
import numpy as np
d = {
"var1": [np.nan, 2, 3, np.nan, np.nan],
"var2": [np.nan, np.nan, np.nan, 1, 1],
"var3": [1, np.nan, np.nan, np.nan, np.nan]
}
df = pd.DataFrame(data=d)
print(df)
var1 var2 var3
0 NaN NaN 1.0
1 2.0 NaN NaN
2 3.0 NaN NaN
3 NaN 1.0 NaN
4 NaN 1.0 NaN
# Confirm only one column has a value in each row
assert df.loc[~(df.var1.isna()) & (~(df.var2.isna()) | ~(df.var3.isna()))].empty
assert df.loc[~(df.var2.isna()) & (~(df.var1.isna()) | ~(df.var3.isna()))].empty
# This third assert isn't really needed
assert df.loc[~(df.var3.isna()) & (~(df.var2.isna()) | ~(df.var1.isna()))].empty
CodePudding user response:
If you want to check exactly one non-nan:
assert df.notna().sum(axis=1).eq(1).all()
If you want to check at most one non-nan, change eq
to le
for less than or equal to