Home > Software engineering >  Identifing continous sequences of a condition
Identifing continous sequences of a condition

Time:09-01

I have a dataframe that looks like this

import pandas as pd

df = pd.DataFrame({
    'column': [0,0,1,1,1,2,2,2,2,1,1,]
})

    column
0        0
1        0
2        1
3        1
4        1
5        2
6        2
7        2
8        2
9        1
10       1

I am interested in identifying all continuous sequences(e.g. no jump in the index number) that matches some condition, the latter is trivial, e.g.

df[df['column'] == 1]

    column
2        1
3        1
4        1
9        1
10       1

The next step is to identify the indexes 2-4 and 9-10 as two discrete entities. I could(and have) make a regular python loop with some logic that does this to give something like(do not necessary need the indexes, just a way to extract the sequences themselves)

[(2,4), (9,10)]

But this seems unnecessary, and I am wondering if anyone have a more of a 'pandas' / efficient method?

CodePudding user response:

You can use:

# group by consecutive values
g = df['column'].ne(df['column'].shift()).cumsum()
# identify rows with "1"
m = df['column'].eq(1)

out = (df[m]
       .groupby(g, sort=False)
       .apply(lambda g: (g.index[0], g.index[-1]))
       .tolist()
      )

Output: [(2, 4), (9, 10)]

CodePudding user response:

Identify the target groups with:

grp = df['column'].diff().ne(0).cumsum()

Then group by those, pull first and last index for each group. Then select only group 1 and convert to list:

df.groupby(['column',grp]).apply(lambda x: (x.index[0],x.index[-1])).xs(1).to_list()

Result

[(2, 4), (9, 10)]
  • Related