Home > Software engineering >  Remove columns with more than N consecutive NaN's
Remove columns with more than N consecutive NaN's

Time:09-23

I have a DataFrame with around 1000 columns, some columns have 0 NaNs, some have 3, some have 400.

What I want to do is remove all columns where there exists a number of consecutive NaNs that are larger than some threshold N, the rest I will impute by taking mean of nearest neighbors.

df

ColA | ColB | ColC  | ColD | ColE
NaN     5      3      NaN    NaN
NaN     6     NaN      4      4 
NaN     7      4       4      4 
NaN     5      5      NaN    NaN
NaN     5      4      NaN     4
NaN     3      3      NaN     3

threshold = 2

remove_consecutive_nan(df,threshold)

Which would return

ColB | ColC | ColE
 5      3     NaN
 6     NaN     4
 7      4      4
 5      5     NaN
 5      4      4
 3      3      3

How would I write the remove_consecutive_nan function?

CodePudding user response:

You can create groups for each column of same values for consecutive missing values, then count each column separately and last filter out columns by threshold:

def remove_consecutive_nan(df,threshold):
    m = df.notna()
    mask = m.cumsum().mask(m).apply(pd.Series.value_counts).gt(threshold)
    return df.loc[:, ~mask.any(axis=0)]

print (remove_consecutive_nan(df, 2))

   ColB  ColC  ColE
0     5   3.0   NaN
1     6   NaN   4.0
2     7   4.0   4.0
3     5   5.0   NaN
4     5   4.0   4.0
5     3   3.0   3.0

Alternative with counts missing values by range:

def remove_consecutive_nan(df,threshold):
    m = df.isna()
    b = m.cumsum()
    mask = b.sub(b.mask(m).ffill().fillna(0)).gt(threshold)
    return df.loc[:, ~mask.any(axis=0)]
  • Related