I have a DataFrame with around 1000 columns, some columns have 0 NaNs, some have 3, some have 400.
What I want to do is remove all columns where there exists a number of consecutive NaNs that are larger than some threshold N, the rest I will impute by taking mean of nearest neighbors.
df
ColA | ColB | ColC | ColD | ColE
NaN 5 3 NaN NaN
NaN 6 NaN 4 4
NaN 7 4 4 4
NaN 5 5 NaN NaN
NaN 5 4 NaN 4
NaN 3 3 NaN 3
threshold = 2
remove_consecutive_nan(df,threshold)
Which would return
ColB | ColC | ColE
5 3 NaN
6 NaN 4
7 4 4
5 5 NaN
5 4 4
3 3 3
How would I write the remove_consecutive_nan
function?
CodePudding user response:
You can create groups for each column of same values for consecutive missing values, then count each column separately and last filter out columns by threshold
:
def remove_consecutive_nan(df,threshold):
m = df.notna()
mask = m.cumsum().mask(m).apply(pd.Series.value_counts).gt(threshold)
return df.loc[:, ~mask.any(axis=0)]
print (remove_consecutive_nan(df, 2))
ColB ColC ColE
0 5 3.0 NaN
1 6 NaN 4.0
2 7 4.0 4.0
3 5 5.0 NaN
4 5 4.0 4.0
5 3 3.0 3.0
Alternative with counts missing values by range:
def remove_consecutive_nan(df,threshold):
m = df.isna()
b = m.cumsum()
mask = b.sub(b.mask(m).ffill().fillna(0)).gt(threshold)
return df.loc[:, ~mask.any(axis=0)]