How to create a series of integers in a dataframe only when there are no NaNs, and integers increasi-CodePudding

I have a dataframe that look like this:

data = pd.DataFrame([NaN, NaN, NaN, 0.5, 1.2, 2.7, 3.8, NaN, NaN, 0.1, 0.7, 2.3, NaN, NaN, NaN, NaN, NaN, 0.01, 0.4, 1.5, 2.8, 4.5, 5.6, NaN, NaN, NaN, NaN, NaN, NaN])

The extra column I need to create should look like this:

data = pd.DataFrame([NaN, NaN, NaN, 1, 1, 1, 1, NaN, NaN, 2, 2, 2, NaN, NaN, NaN, NaN, NaN, 3, 3, 3, 3, 3, 3, NaN, NaN, NaN, NaN, NaN, NaN])

Basically, when there are NaNs, the integer remains the same, and when not a NaN is found it should increment by 1 unit and be constant until another NaN is found.

Is there an elegant way to do this without going through every row individually? My dataframe is very large.

CodePudding user response：

Let's consider this input data

NaN = np.nan
data = pd.DataFrame(
    [NaN, NaN, NaN, 0.5, 1.2, 2.7, 3.8, NaN, NaN, 0.1, 0.7, 2.3, NaN, NaN, 
     NaN, NaN, NaN, 0.01, 0.4, 1.5, 2.8, 4.5, 5.6, NaN, NaN, NaN, NaN, NaN, NaN])

So the idea is to first create a Boolean column where it is not nan with notna.

m = data[0].notna()

Now, you can get the diff on this Boolean column to get True when it change from True to False or vice-versa. Keep the True only when it changes when notna with &m. Use cumsum to get the incremental integer and then where with the Boolean mask m to replace by NaN where it was Nan originally. So in one liner:

data['res'] = (m.diff()&m).cumsum().where(m)
print(data)
       0  res
0    NaN  NaN
1    NaN  NaN
2    NaN  NaN
3   0.50  1.0
4   1.20  1.0
5   2.70  1.0
6   3.80  1.0
7    NaN  NaN
8    NaN  NaN
9   0.10  2.0
10  0.70  2.0
11  2.30  2.0
12   NaN  NaN
13   NaN  NaN
...