I am trying to do the following:
Supposing I have the following column on pandas, where there will be always two values that are equal in sequence.
l = [np.nan, np.nan, 10, np.nan, np.nan, np.nan, 10, np.nan, 4, np.nan, 4, 5, np.nan, 5, np.nan, 2, np.nan, 2, 1, 1]
How can I fill NaN values only in between the interval of similar values ?
expected output:
[np.nan, np.nan, 10, 10, 10, 10, 10, np.nan, 4, 4, 4, 5, 5, 5, np.nan, 2, 2, 2, 1, 1]
I could only find this answer, which is not the same problem:
CodePudding user response:
Solution with ffill
and bfill
f = df['col'].ffill()
b = df['col'].bfill()
df['col'].mask(f == b, f)
0 NaN
1 NaN
2 10.0
3 10.0
4 10.0
5 10.0
6 10.0
7 NaN
8 4.0
9 4.0
10 4.0
11 5.0
12 5.0
13 5.0
14 NaN
15 2.0
16 2.0
17 2.0
18 1.0
19 1.0
Name: col, dtype: float64
CodePudding user response:
You can use
m = df['l'].eq(df['l']).cumsum()
out = df.groupby(m).apply(lambda g: g.fillna({'l': g['l'].max()}) if g.name % 2 else g)
Detailed explaination:
Assume you have the dataframe
l = [np.nan, np.nan, 10, np.nan, np.nan, np.nan, 10, np.nan, 4, np.nan, 4, 5, np.nan, 5, np.nan, 2, np.nan, 2, 1, 1, np.nan]
df = pd.DataFrame({'l': l})
print(df)
l
0 NaN
1 NaN
2 10.0
3 NaN
4 NaN
5 NaN
6 10.0
7 NaN
8 4.0
9 NaN
10 4.0
11 5.0
12 NaN
13 5.0
14 NaN
15 2.0
16 NaN
17 2.0
18 1.0
19 1.0
20 NaN
You can use the feature that NaN is not equal with NaN to create a cumsum
.
df['m'] = df['l'].eq(df['l']).cumsum()
print(df)
l m
0 NaN 0
1 NaN 0
2 10.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 10.0 2
7 NaN 2
8 4.0 3
9 NaN 3
10 4.0 4
11 5.0 5
12 NaN 5
13 5.0 6
14 NaN 6
15 2.0 7
16 NaN 7
17 2.0 8
18 1.0 9
19 1.0 10
20 NaN 10
We can notice that the True
only occurs in pairs. The first line could only be True
or False
, so no matter what the first line is
- The cumsum to the start
True
of firstTrue
pair could only be1
- The cumsum to the end
True
of firstTrue
pair could only be2
This happens to other True
pairs: the cumsum to the start True
is odd number, the cumsum to the end True
is even number.
With this in mind, we can do a groupby then only fill the odd sum value with the not non value in group.
out = df.groupby(m).apply(lambda g: g.fillna({'l': g['l'].max()}) if g.name % 2 else g)
print(out)
l
0 NaN
1 NaN
2 10.0
3 10.0
4 10.0
5 10.0
6 10.0
7 NaN
8 4.0
9 4.0
10 4.0
11 5.0
12 5.0
13 5.0
14 NaN
15 2.0
16 2.0
17 2.0
18 1.0
19 1.0
20 NaN