Home > Net >  Pandas foward fill in between - same values
Pandas foward fill in between - same values

Time:05-16

I am trying to do the following:

Supposing I have the following column on pandas, where there will be always two values that are equal in sequence.

l = [np.nan, np.nan, 10, np.nan, np.nan, np.nan, 10, np.nan, 4, np.nan, 4, 5, np.nan, 5, np.nan,  2, np.nan, 2, 1, 1]

How can I fill NaN values only in between the interval of similar values ?

expected output:

[np.nan, np.nan, 10, 10, 10, 10, 10, np.nan, 4, 4, 4, 5, 5, 5, np.nan,  2, 2, 2, 1, 1]

I could only find this answer, which is not the same problem:

Stackoverflow Question

CodePudding user response:

Solution with ffill and bfill

f = df['col'].ffill()
b = df['col'].bfill()

df['col'].mask(f == b, f)

0      NaN
1      NaN
2     10.0
3     10.0
4     10.0
5     10.0
6     10.0
7      NaN
8      4.0
9      4.0
10     4.0
11     5.0
12     5.0
13     5.0
14     NaN
15     2.0
16     2.0
17     2.0
18     1.0
19     1.0
Name: col, dtype: float64

CodePudding user response:

You can use

m = df['l'].eq(df['l']).cumsum()
out = df.groupby(m).apply(lambda g: g.fillna({'l': g['l'].max()}) if g.name % 2 else g)

Detailed explaination:

Assume you have the dataframe

l = [np.nan, np.nan, 10, np.nan, np.nan, np.nan, 10, np.nan, 4, np.nan, 4, 5, np.nan, 5, np.nan,  2, np.nan, 2, 1, 1, np.nan]

df = pd.DataFrame({'l': l})
print(df)

       l
0    NaN
1    NaN
2   10.0
3    NaN
4    NaN
5    NaN
6   10.0
7    NaN
8    4.0
9    NaN
10   4.0
11   5.0
12   NaN
13   5.0
14   NaN
15   2.0
16   NaN
17   2.0
18   1.0
19   1.0
20   NaN

You can use the feature that NaN is not equal with NaN to create a cumsum.

df['m'] = df['l'].eq(df['l']).cumsum()
print(df)

       l   m
0    NaN   0
1    NaN   0
2   10.0   1
3    NaN   1
4    NaN   1
5    NaN   1
6   10.0   2
7    NaN   2
8    4.0   3
9    NaN   3
10   4.0   4
11   5.0   5
12   NaN   5
13   5.0   6
14   NaN   6
15   2.0   7
16   NaN   7
17   2.0   8
18   1.0   9
19   1.0  10
20   NaN  10

We can notice that the True only occurs in pairs. The first line could only be True or False, so no matter what the first line is

  • The cumsum to the start True of first True pair could only be 1
  • The cumsum to the end True of first True pair could only be 2

This happens to other True pairs: the cumsum to the start True is odd number, the cumsum to the end True is even number.

With this in mind, we can do a groupby then only fill the odd sum value with the not non value in group.

out = df.groupby(m).apply(lambda g: g.fillna({'l': g['l'].max()}) if g.name % 2 else g)
print(out)

       l
0    NaN
1    NaN
2   10.0
3   10.0
4   10.0
5   10.0
6   10.0
7    NaN
8    4.0
9    4.0
10   4.0
11   5.0
12   5.0
13   5.0
14   NaN
15   2.0
16   2.0
17   2.0
18   1.0
19   1.0
20   NaN
  • Related