Home > other >  Concise approach to check that only one column is populated at the row level
Concise approach to check that only one column is populated at the row level

Time:11-05

I have a Pandas dataframe with 3 columns and I plan to create a 4th column that is a compilation of the values in these. The business rules stipulate that if one column is not NULL then the other columns are NULL for a given row. However, I want to insert a check.

The asserts below are sufficient for confirming the business rules. However, this is not very scalable. What would be a better way to write this such that any number of variables can be added?

import pandas as pd
import numpy as np

d = {
     "var1": [np.nan, 2, 3, np.nan, np.nan],
     "var2": [np.nan, np.nan, np.nan, 1, 1],
     "var3": [1, np.nan, np.nan, np.nan, np.nan]
     }
df = pd.DataFrame(data=d)

print(df)

   var1  var2  var3
0   NaN   NaN   1.0
1   2.0   NaN   NaN
2   3.0   NaN   NaN
3   NaN   1.0   NaN
4   NaN   1.0   NaN

# Confirm only one column has a value in each row
assert df.loc[~(df.var1.isna()) & (~(df.var2.isna()) | ~(df.var3.isna()))].empty
assert df.loc[~(df.var2.isna()) & (~(df.var1.isna()) | ~(df.var3.isna()))].empty
# This third assert isn't really needed
assert df.loc[~(df.var3.isna()) & (~(df.var2.isna()) | ~(df.var1.isna()))].empty

CodePudding user response:

If you want to check exactly one non-nan:

assert df.notna().sum(axis=1).eq(1).all()

If you want to check at most one non-nan, change eq to le for less than or equal to

  • Related