I have a dataframe that look like this:
data = pd.DataFrame([NaN, NaN, NaN, 0.5, 1.2, 2.7, 3.8, NaN, NaN, 0.1, 0.7, 2.3, NaN, NaN, NaN, NaN, NaN, 0.01, 0.4, 1.5, 2.8, 4.5, 5.6, NaN, NaN, NaN, NaN, NaN, NaN])
The extra column I need to create should look like this:
data = pd.DataFrame([NaN, NaN, NaN, 1, 1, 1, 1, NaN, NaN, 2, 2, 2, NaN, NaN, NaN, NaN, NaN, 3, 3, 3, 3, 3, 3, NaN, NaN, NaN, NaN, NaN, NaN])
Basically, when there are NaNs, the integer remains the same, and when not a NaN is found it should increment by 1 unit and be constant until another NaN is found.
Is there an elegant way to do this without going through every row individually? My dataframe is very large.
CodePudding user response:
Let's consider this input data
NaN = np.nan
data = pd.DataFrame(
[NaN, NaN, NaN, 0.5, 1.2, 2.7, 3.8, NaN, NaN, 0.1, 0.7, 2.3, NaN, NaN,
NaN, NaN, NaN, 0.01, 0.4, 1.5, 2.8, 4.5, 5.6, NaN, NaN, NaN, NaN, NaN, NaN])
So the idea is to first create a Boolean column where it is not nan with notna
.
m = data[0].notna()
Now, you can get the diff
on this Boolean column to get True
when it change from True to False or vice-versa. Keep the True
only when it changes when notna
with &m
. Use cumsum
to get the incremental integer and then where
with the Boolean mask m
to replace by NaN where it was Nan originally. So in one liner:
data['res'] = (m.diff()&m).cumsum().where(m)
print(data)
0 res
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 0.50 1.0
4 1.20 1.0
5 2.70 1.0
6 3.80 1.0
7 NaN NaN
8 NaN NaN
9 0.10 2.0
10 0.70 2.0
11 2.30 2.0
12 NaN NaN
13 NaN NaN
...