Home > Mobile >  pandas: Select rows by diff with previous columns, but only one time per row
pandas: Select rows by diff with previous columns, but only one time per row

Time:10-02

I have a dataset like below

ID value
1 10
2 15
3 18
4 30
5 35

I would like to keep all the rows that has value - value of the previous row <=5, so I do

df['diff'] = df.value.diff()
df = df[df.diff <= 5]

Then I will have

ID value diff
2 15 5
3 18 3
5 35 5

However, I don't want to keep row 3, because row 2 is kept due to row 1, and as row 1 and row 2 become a pair, row 3 should not be paired with row 2 anymore.

How could I do that using pandas? Indeed I can write a for loop but it is not the best idea.

CodePudding user response:

So you have the mask that checks if difference to previous row <= 5:

>>> d = df.value.diff().le(5)
>>> d
1    False
2     True
3     True
4    False
5     True

Rows marked with True will be kept, but you don't want to keep a True row if the previous row was also True.

Then we can shift this mask, negate it and & with the original to convert True's that have True in previous row into False:

>>> d & ~d.shift(fill_value=False)
1    False
2     True
3    False
4    False
5     True

where fill_value is needed otherwise there arises NaN and it "can't bitwise-negate float". Putting False there has no effect other than silencing that issue.

Now we can select the rows from the dataframe with this resultant mask:

>>> wanted = d & ~d.shift(fill_value=False)
>>> df[wanted]

ID  value
2   15
5   35
  • Related