Remove columns with more than N consecutive NaN's-CodePudding

I have a DataFrame with around 1000 columns, some columns have 0 NaNs, some have 3, some have 400.

What I want to do is remove all columns where there exists a number of consecutive NaNs that are larger than some threshold N, the rest I will impute by taking mean of nearest neighbors.

df

ColA | ColB | ColC  | ColD | ColE
NaN     5      3      NaN    NaN
NaN     6     NaN      4      4 
NaN     7      4       4      4 
NaN     5      5      NaN    NaN
NaN     5      4      NaN     4
NaN     3      3      NaN     3

threshold = 2

remove_consecutive_nan(df,threshold)

Which would return

ColB | ColC | ColE
 5      3     NaN
 6     NaN     4
 7      4      4
 5      5     NaN
 5      4      4
 3      3      3

How would I write the remove_consecutive_nan function?

CodePudding user response：

You can create groups for each column of same values for consecutive missing values, then count each column separately and last filter out columns by threshold:

def remove_consecutive_nan(df,threshold):
    m = df.notna()
    mask = m.cumsum().mask(m).apply(pd.Series.value_counts).gt(threshold)
    return df.loc[:, ~mask.any(axis=0)]

print (remove_consecutive_nan(df, 2))

   ColB  ColC  ColE
0     5   3.0   NaN
1     6   NaN   4.0
2     7   4.0   4.0
3     5   5.0   NaN
4     5   4.0   4.0
5     3   3.0   3.0

Alternative with counts missing values by range:

def remove_consecutive_nan(df,threshold):
    m = df.isna()
    b = m.cumsum()
    mask = b.sub(b.mask(m).ffill().fillna(0)).gt(threshold)
    return df.loc[:, ~mask.any(axis=0)]