Find the first and the last row index of nan values in a pandas dataframe if the number of consecuti-CodePudding

There is a pandas dataframe with 80 rows and 6 columns. Some columns have nan values in them. Let's say that if the number of consecutive nan values in a given column exceeds 10, then we have to do something about the values of those chunks of nans. So, my question is how to extract the row indices corresponding to the start and the end of these big chunks of nans?

CodePudding user response：

setup

Let's assume the following example input:

np.random.seed(3)
df = pd.DataFrame(np.random.choice([0,1], p=[0.7, 0.3], size=(80,6)))
df = df.mask(df.eq(0))
df.head(15)

input:

      0    1    2    3    4    5
0   NaN  1.0  NaN  NaN  1.0  1.0
1   NaN  NaN  NaN  NaN  NaN  NaN
2   NaN  NaN  NaN  NaN  NaN  NaN
3   NaN  NaN  NaN  NaN  NaN  NaN
4   NaN  1.0  NaN  NaN  NaN  1.0
5   1.0  NaN  1.0  1.0  NaN  NaN
6   NaN  NaN  NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  1.0  NaN  NaN
8   NaN  NaN  NaN  NaN  NaN  NaN
9   NaN  NaN  1.0  1.0  NaN  NaN
10  NaN  NaN  NaN  NaN  NaN  NaN
11  NaN  NaN  1.0  NaN  NaN  NaN
12  1.0  NaN  NaN  NaN  NaN  NaN
13  NaN  NaN  NaN  NaN  NaN  1.0
14  NaN  NaN  1.0  NaN  1.0  1.0
...
79  1.0  NaN  NaN  NaN  NaN  NaN

computing the consecutive NaN length

You can compute a mask with isna, then generate a group with a combination of mask.ne(mask.shift()), cumsum and mask it with where. Finally apply a groupby transformation with 'size' to get the number of consecutive NaNs:

mask = df.isna()

df_na_sizes = (mask.ne(mask.shift()).cumsum()
                   .where(mask)
                   .apply(lambda c: c.groupby(c).transform('size'))
               )

output:

      0     1    2    3     4    5
0   5.0   NaN  5.0  5.0   NaN  NaN
1   5.0   3.0  5.0  5.0  13.0  3.0
2   5.0   3.0  5.0  5.0  13.0  3.0
3   5.0   3.0  5.0  5.0  13.0  3.0
4   5.0   NaN  5.0  5.0  13.0  NaN
5   NaN  12.0  NaN  NaN  13.0  8.0
6   6.0  12.0  3.0  1.0  13.0  8.0
7   6.0  12.0  3.0  NaN  13.0  8.0
8   6.0  12.0  3.0  1.0  13.0  8.0
9   6.0  12.0  NaN  NaN  13.0  8.0
10  6.0  12.0  1.0  5.0  13.0  8.0
11  6.0  12.0  NaN  5.0  13.0  8.0
12  NaN  12.0  2.0  5.0  13.0  8.0
13  2.0  12.0  2.0  5.0  13.0  NaN
14  2.0  12.0  NaN  5.0   NaN  NaN
    ...
79  NaN   1.0  2.0  5.0   4.0  1.0

using our mask for filtering

Now you can use this to select by group size, for example let's replace NaNs with 'XX' if the consecutive NaNs are equal or greater to 10:

>>> df.mask(df_na_sizes.ge(10), 'XX')
      0    1    2    3    4    5
0   NaN  1.0  NaN  NaN  1.0  1.0
1   NaN  NaN  NaN  NaN   XX  NaN
2   NaN  NaN  NaN  NaN   XX  NaN
3   NaN  NaN  NaN  NaN   XX  NaN
4   NaN  1.0  NaN  NaN   XX  1.0
5   1.0   XX  1.0  1.0   XX  NaN
6   NaN   XX  NaN  NaN   XX  NaN
7   NaN   XX  NaN  1.0   XX  NaN
8   NaN   XX  NaN  NaN   XX  NaN
9   NaN   XX  1.0  1.0   XX  NaN
10  NaN   XX  NaN  NaN   XX  NaN
11  NaN   XX  1.0  NaN   XX  NaN
12  1.0   XX  NaN  NaN   XX  NaN
13  NaN   XX  NaN  NaN   XX  1.0
14  NaN   XX  1.0  NaN  1.0  1.0
...
79  1.0  NaN  NaN  NaN  NaN  NaN