I have a dataframe that looks like this
import pandas as pd
df = pd.DataFrame({
'column': [0,0,1,1,1,2,2,2,2,1,1,]
})
column
0 0
1 0
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 1
10 1
I am interested in identifying all continuous sequences(e.g. no jump in the index number) that matches some condition, the latter is trivial, e.g.
df[df['column'] == 1]
column
2 1
3 1
4 1
9 1
10 1
The next step is to identify the indexes 2-4 and 9-10 as two discrete entities. I could(and have) make a regular python loop with some logic that does this to give something like(do not necessary need the indexes, just a way to extract the sequences themselves)
[(2,4), (9,10)]
But this seems unnecessary, and I am wondering if anyone have a more of a 'pandas' / efficient method?
CodePudding user response:
You can use:
# group by consecutive values
g = df['column'].ne(df['column'].shift()).cumsum()
# identify rows with "1"
m = df['column'].eq(1)
out = (df[m]
.groupby(g, sort=False)
.apply(lambda g: (g.index[0], g.index[-1]))
.tolist()
)
Output: [(2, 4), (9, 10)]
CodePudding user response:
Identify the target groups with:
grp = df['column'].diff().ne(0).cumsum()
Then group by those, pull first and last index for each group. Then select only group 1 and convert to list:
df.groupby(['column',grp]).apply(lambda x: (x.index[0],x.index[-1])).xs(1).to_list()
Result
[(2, 4), (9, 10)]