Home > Software engineering >  Pandas: Get rows with consecutive column values
Pandas: Get rows with consecutive column values

Time:08-12

I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x: I want to specify consecutive values in column x? Say if I want consecutive values of 3 and 5 only

col row x   y
1   1   1   1
5   7   3   0
2   2   2   2
6   3   3   8
9   2   3   4
5   3   3   9
4   9   4   4
5   5   5   1
3   7   5   2
6   6   6   6
5   8   6   2
3   7   6   0

The results output would be:

col row x   y
6   3   3   8
9   2   3   4
5   3   3   9
5   5   5   1
3   7   5   2

I tried

m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]

But that includes the consecutive 6 that I don't want.

I also tried:

df.query( 'x in [3,5]') 

That prints every row where x has 3 or 5.

CodePudding user response:

IIUC use masks for boolean indexing. Check for 3 or 5, and use a cummax and reverse cummax to ensure having the order:

m1 = df['x'].eq(3)
m2 = df['x'].eq(5)

out = df[(m1|m2)&(m1.cummax()&m2[::-1].cummax())]

Output:

   col  row  x  y
2    6    3  3  8
3    9    2  3  4
4    5    3  3  9
6    5    5  5  1
7    3    7  5  2

CodePudding user response:

you can create a group column for consecutive values, and filter by the group count and value of x:

# create unique ids for consecutive groups, then get group length:
group_num = (df.x.shift() != df.x).cumsum()
group_len = group_num.groupby(group_num).transform("count")

# filter main df:
df[(df.x.isin([3,5])) & (group_len > 1)]

output:

   col  row  x  y
3    6    3  3  8
4    9    2  3  4
5    5    3  3  9
7    5    5  5  1
8    3    7  5  2
  • Related