Obtain number of rows where one column has a value and the others don't-CodePudding

I have a dataframe with 48 columns and 4000 rows. I want to know the number of rows (or in %) where one column has a value and the others are null, but for all columns (as factorial, so I can obtain all possible combinations in a row, between filled variables and NaN).

For example:

in 20% of total rows, varA is filled and all others are NaN

in 30% of total rows, varA and varB are filled and all the others are Nan

Can someone help me with the code in python??

Thank you so much in advance :)

CodePudding user response：

from itertools import combinations

result = {
    cols: (
        df[list(cols)].notna().all(axis=1)               # the selected variables are not nan
        & df.drop(columns=list(cols)).isna().all(axis=1) # and the rest is nan
    ).sum()
    # change to combinations(df.columns, 1) for single variable
    # change to combinations(df.columns, 3) for triple variable
    # etc.
    for cols in combinations(df.columns, 2)
}

Result is a dictionary that looks like this:

{
    ('varA', 'varB'): 10,
    ('varA', 'varC'): 20,
    ...
}

... which means there are 10 rows where varA and varB are filled but the remaining variables are all NaN; 20 for the (varA, varC) combination.

CodePudding user response：

pd.notna will give you a series of boolean values indicating whether the argument is not-NaN, so you have True for not-NaN and False for NaN.

You can do this for all the columns you care about, and then use np.logical_and.reduce with axis=1 to find which rows satisfy this condition for all columns you care about.

pd.isna will give you the inverse. To enforce the "all other columns are NaN" condition, you can apply this similar treatment to the columns you are not interested in.

Finally, .sum() on a boolean array will give the number of elements that are True, which is what we want.

import numpy as np

def count_not_na_rows(df: pd.DataFrame, columns_with_values: list) -> int:
    nan_columns = list(set(df.columns) - set(columns_with_values))
    not_nan_rows = np.logical_and.reduce(pd.notna(df.loc[:, columns_with_values]), axis=1)
    nan_rows = np.logical_and.reduce(pd.isna(df.loc[:, nan_columns]), axis=1)
    result = np.logical_and(not_nan_rows, nan_rows)
    return result.sum()

To test this:

df = pd.DataFrame([[2, 1, np.nan],
 [1, np.nan, np.nan],
 [np.nan, np.nan, np.nan]], columns=["varA", "varB", "varC"])

count_not_na_rows(df, ["varA"])  # 1  -> only one row has varA but NaN for everything else
count_not_na_rows(df, ["varB"])  # 0  -> No rows have varB but NaN for everything else
count_not_na_rows(df, ["varC"])  # 0

count_not_na_rows(df, ["varA", "varB"]) # 1 -> only one row has varA and varB but NaN for everything else