I would want to calculate a list of missing columns id's.
Current_dataframe:
data = {'DIVISION': ['DENVER', 'JEWEL', 'JEWEL','DENVER', np.nan],
'STORE_ID': [np.nan,np.nan,'0001','0001',np.nan],
'FORECAST_DATE':['2021-08-15','2021-08-15','2021-08-15','2021-08-15','2021-08-15'],
'START_TIME':[np.nan,np.nan,9.0,np.nan,np.nan],
'FULFILLMENT_TYPE':['DUG','DELIVERY','DUG','DUG','DUG'],
'MULTIPLIER':[1.04,1.10,1.04,1.10,0.90],
'OVERWRITE':[np.nan,0,np.nan,1,2]
}
df = pd.DataFrame(data)
Expected output:
data = {'DIVISION': ['DENVER', 'JEWEL', 'JEWEL','DENVER', np.nan],
'STORE_ID': [np.nan,np.nan,'0001','0001',np.nan],
'FORECAST_DATE':['2021-08-15','2021-08-15','2021-08-15','2021-08-15','2021-08-15'],
'START_TIME':[np.nan,np.nan,9.0,np.nan,np.nan],
'FULFILLMENT_TYPE':['DUG','DELIVERY','DUG','DUG','DUG'],
'MULTIPLIER':[1.04,1.10,1.04,1.10,0.90],
'OVERWRITE':[np.nan,0,np.nan,1,2],
'MISSING_ID':[[1,3],[1,3],np.nan,[3],[0,1,3]]
}
expected = pd.DataFrame(data)
Where 'MISSING_ID' is based on missing values of columns DIVISION, STORE_ID, FORECAST_DATE, START_TIME and FULFILMENT_TYPE
I can achieve this by writing multiple np.where conditions and then using a list:
multiplier_df['DIV_MISSING'] = np.where(multiplier_df.DIVISION.isna(),1,0)
multiplier_df['STORE_MISSING'] = np.where(multiplier_df.STORE_ID.isna(),2,0)
multiplier_df['DATE_MISSING'] = np.where(multiplier_df.FORECAST_DATE.isna(),3,0)
multiplier_df['HOUR_MISSING'] = np.where(multiplier_df.START_TIME.isna(),4,0)
multiplier_df['FULFILLMENT_MISSING'] = np.where(multiplier_df.FULFILLMENT_TYPE.isna(),5,0)
multiplier_df['MISSING_ID'] = multiplier_df[['DIV_MISSING', 'STORE_MISSING','DATE_MISSING','HOUR_MISSING','FULFILLMENT_MISSING']].apply(list, axis=1)
However this results in 0's which are unwanted and thus looking for a simpler way to do it. I see a similar solution in R here : " Return list of column names with missing (NA) data for each row of a data frame in R " But need to do in Pandas
CodePudding user response:
# 1) Take part of dataframe, only those columns in which we search for nans
df_part = df[['DIVISION', 'STORE_ID', 'FORECAST_DATE', 'START_TIME', 'FULFILLMENT_TYPE']]
# 2) use pd.isna(df).values to get np.array of True/False indicating where nans are
# then use list comprehension and np.where to determine for each row which columns have nans
df['MISSING_ID'] = [np.where(row)[0] for row in pd.isna(df_part).values]
print(df)
DIVISION STORE_ID FORECAST_DATE ... MULTIPLIER OVERWRITE MISSING_ID
0 DENVER NaN 2021-08-15 ... 1.04 NaN [1, 3]
1 JEWEL NaN 2021-08-15 ... 1.10 0.0 [1, 3]
2 JEWEL 0001 2021-08-15 ... 1.04 NaN []
3 DENVER 0001 2021-08-15 ... 1.10 1.0 [3]
4 NaN NaN 2021-08-15 ... 0.90 2.0 [0, 1, 3]
For now it gives [] instead of np.nan, but you can fix this easily.
df['MISSING_ID'] = df['MISSING_ID'].apply(lambda x: x if len(x) else np.nan)