I have the following dataframe:
feature
0 1
1 0
2 0
3 0
4 0
5 1
6 0
7 1
I would like to create a 2 columns that will include the number of rows between the current row and the last and next positive value. The output dataframe should be something like this:
feature previous_feat next_feat
0 1 NA 5
1 0 1 4
2 0 2 3
3 0 3 2
4 0 4 1
5 1 5 2
6 0 1 1
7 1 2 NA
I already tried things around the combination of shift and mask methods but I don't manage to make it works. Note that it could be the number of row or the index difference it doesn't really matter for me. Same thing for the NA values, it could be NA or 0.
import pandas as pd
df = pd.DataFrame({"feature": [1, 0, 0, 0, 0, 1, 0, 1]})
# df["previous_feat"] = df.shift().mask(df["feature"] != 0)
CodePudding user response:
You can use groupby.cumcount
and boolean masks:
# form groups
g1 = df.loc[::-1, 'feature'].eq(1).cumsum()
g2 = df['feature'].eq(1).cumsum()
# mask first/last
m1 = g2.eq(1) & df['feature'].eq(1)
m2 = g1.eq(1) & df['feature'].eq(1)
# compute cumcount
df['previous_feat'] = df.groupby(g1).cumcount().add(1).mask(m1)
df['next_feat'] = df[::-1].groupby(g2).cumcount().add(1).mask(m2)
Output:
feature previous_feat next_feat
0 1 NaN 5.0
1 0 1.0 4.0
2 0 2.0 3.0
3 0 3.0 2.0
4 0 4.0 1.0
5 1 5.0 2.0
6 0 1.0 1.0
7 1 2.0 NaN
CodePudding user response:
groups = df["feature"].shift().cumsum()
df["previous_feat"] = df["feature"].groupby(groups).cumcount().add(1)
df["next_feat"] = df["feature"].groupby(groups).cumcount(ascending=False).add(1).shift(-1)
feature previous_feat next_feat
0 1 NaN 5.0
1 0 1.0 4.0
2 0 2.0 3.0
3 0 3.0 2.0
4 0 4.0 1.0
5 1 5.0 2.0
6 0 1.0 1.0
7 1 2.0 NaN