One hot vector in pandas to encode missing values-CodePudding

I am working with a large pandas dataframe and a few columns have lots of missing data. I am not totally confident with my imputation and I believe the presence or absence of data for these variables could be useful information, so I would like to add another column of the dataframe with 0 where the entry is missing and 1 otherwise. Is there a quick/efficient way to do this in pandas?

CodePudding user response：

Try out the following:

df['New_Col'] = df['Col'].notna().astype('uint8')

Where Col it your column containing np.nan values and New_Col your binary target column indicating whether Col contains np.nan.

CodePudding user response：

The relevant function here is .notna, which will yield bool depending on whether the value is missing or not. To apply it to multiple columns of interest, use:

for c in cols_of_interest:
    df[f'{c}_not_missing'] = 1 * df[c].notna()

Note that 1 * bool will give integer 0/1.