Calculate list of missing columns id's-CodePudding

I would want to calculate a list of missing columns id's.

Current_dataframe:

data = {'DIVISION':  ['DENVER', 'JEWEL', 'JEWEL','DENVER', np.nan],
        'STORE_ID': [np.nan,np.nan,'0001','0001',np.nan],
        'FORECAST_DATE':['2021-08-15','2021-08-15','2021-08-15','2021-08-15','2021-08-15'],
        'START_TIME':[np.nan,np.nan,9.0,np.nan,np.nan],
        'FULFILLMENT_TYPE':['DUG','DELIVERY','DUG','DUG','DUG'],
        'MULTIPLIER':[1.04,1.10,1.04,1.10,0.90],
        'OVERWRITE':[np.nan,0,np.nan,1,2]
        }
df = pd.DataFrame(data)

Expected output:

data = {'DIVISION':  ['DENVER', 'JEWEL', 'JEWEL','DENVER', np.nan],
        'STORE_ID': [np.nan,np.nan,'0001','0001',np.nan],
        'FORECAST_DATE':['2021-08-15','2021-08-15','2021-08-15','2021-08-15','2021-08-15'],
        'START_TIME':[np.nan,np.nan,9.0,np.nan,np.nan],
        'FULFILLMENT_TYPE':['DUG','DELIVERY','DUG','DUG','DUG'],
        'MULTIPLIER':[1.04,1.10,1.04,1.10,0.90],
        'OVERWRITE':[np.nan,0,np.nan,1,2],
        'MISSING_ID':[[1,3],[1,3],np.nan,[3],[0,1,3]]
        }
expected = pd.DataFrame(data)

Where 'MISSING_ID' is based on missing values of columns DIVISION, STORE_ID, FORECAST_DATE, START_TIME and FULFILMENT_TYPE

I can achieve this by writing multiple np.where conditions and then using a list:

multiplier_df['DIV_MISSING'] = np.where(multiplier_df.DIVISION.isna(),1,0)
multiplier_df['STORE_MISSING'] = np.where(multiplier_df.STORE_ID.isna(),2,0)
multiplier_df['DATE_MISSING'] = np.where(multiplier_df.FORECAST_DATE.isna(),3,0)
multiplier_df['HOUR_MISSING'] = np.where(multiplier_df.START_TIME.isna(),4,0)
multiplier_df['FULFILLMENT_MISSING'] = np.where(multiplier_df.FULFILLMENT_TYPE.isna(),5,0)
multiplier_df['MISSING_ID'] = multiplier_df[['DIV_MISSING', 'STORE_MISSING','DATE_MISSING','HOUR_MISSING','FULFILLMENT_MISSING']].apply(list, axis=1)

However this results in 0's which are unwanted and thus looking for a simpler way to do it. I see a similar solution in R here : " Return list of column names with missing (NA) data for each row of a data frame in R " But need to do in Pandas

CodePudding user response：

# 1) Take part of dataframe, only those columns in which we search for nans
df_part = df[['DIVISION', 'STORE_ID', 'FORECAST_DATE', 'START_TIME', 'FULFILLMENT_TYPE']]

# 2) use pd.isna(df).values to get np.array of True/False indicating where nans are
# then use list comprehension and np.where to determine for each row which columns have nans
df['MISSING_ID'] = [np.where(row)[0] for row in pd.isna(df_part).values]
print(df)

  DIVISION STORE_ID FORECAST_DATE  ...  MULTIPLIER OVERWRITE  MISSING_ID
0   DENVER      NaN    2021-08-15  ...        1.04       NaN      [1, 3]
1    JEWEL      NaN    2021-08-15  ...        1.10       0.0      [1, 3]
2    JEWEL     0001    2021-08-15  ...        1.04       NaN          []
3   DENVER     0001    2021-08-15  ...        1.10       1.0         [3]
4      NaN      NaN    2021-08-15  ...        0.90       2.0   [0, 1, 3]

For now it gives [] instead of np.nan, but you can fix this easily.

df['MISSING_ID'] = df['MISSING_ID'].apply(lambda x: x if len(x) else np.nan)