Home > Enterprise >  How to find jumps in values in dataframe using pandas
How to find jumps in values in dataframe using pandas

Time:08-22

Input :

B      0   15000.000000  698.617126  -0.745298

B      1   17000.000000  647.001708  -0.269421

B      2   18000.000000  649.572265  -0.097540

B      4   19000.000000  665.282775   0.008724

B      5   20000.000000  690.836975  -0.011101

I have a data frame with 5 columns, in the 1st column, there is a jump in the 3rd row, instead of the value 3 it is 4 and in column 2 there is a jump in the 1st row, instead of 16000 it is 17000. How to mark these values. Also, how find this jump with a specified value in each column?

Expected output:

considering 1st column. display the index number when the difference from 1 row to the next row is not 1. example: in the 1st column above, the difference of the value of 1st and 2nd row is 2 instead of 1.

output:

0     nan     nan       nan        nan

1     1      2000      -51.61      0.47628

2     2      1000       2.57       0.171881

CodePudding user response:

In my opinion, the code is heavily dependent on the definition of what jumps are. Let's say it's an absolute difference between vertical neighbors that is greater than 75% of others. In addition, why not mark the areas with stagnation - those which are lower then 25% of others?

In this case the code could be like below:

import pandas as pd
import numpy as np
from io import StringIO

data = '''
0   15000.000000  698.617126  -0.745298
1   17000.000000  647.001708  -0.269421
2   18000.000000  649.572265  -0.097540
4   19000.000000  665.282775   0.008724
5   20000.000000  690.836975  -0.011101
'''

df = pd.read_csv(
    StringIO(data), 
    sep='\s ', 
    header=None, 
    names=['one','two','three','four']
)

# find an absolute difference between vertical neighbours
# and fill the first line with zero as if no difference detected
diff = df.diff().abs().fillna(df.diff().median())      # concide DataFrame.pct_change as an option

# define a significant difference as one
# that is less then 25th or greater than the 75th percentile 
low, high = np.percentile(diff, [25, 75], axis=0)
jump = diff > high
stagnation = diff < low


# color significant difference in green and red
# for the 25 and 75 percentiles, respectively
# notice that .apply iterates over the columns
# and jump/stagnation masks have the same captions as df
(
    df.style
    .hide_index()
    .format({'one':'{:g}','two':'{:g}','three':'{:.2f}','four':'{:.2f}'})
    .apply(lambda x: 
        np.where(
           jump[x.name], 
           'color: red',
            None
       )
    ).apply(lambda x: 
        np.where(
           stagnation[x.name], 
           'color: green', 
           None
       )
    )
)

Output:

image

CodePudding user response:

Given df:

  col_0  col_1    col_2       col_3     col_4
0     B      0  15000.0  698.617126 -0.745298
1     B      1  17000.0  647.001708 -0.269421
2     B      2  18000.0  649.572265 -0.097540
3     B      4  19000.0  665.282775  0.008724
4     B      5  20000.0  690.836975 -0.011101

We can mark where the difference is greater than 1 or 1000 like so:

df['jump_1'] = df['col_1'].diff().gt(1)
df['jump_2'] = df['col_2'].diff().gt(1000)
# If you want to mark the row above, change to look like:
             # df['col_1'].diff(-1).abs().gt(1)

# Output:
  col_0  col_1    col_2       col_3     col_4  jump_1  jump_2
0     B      0  15000.0  698.617126 -0.745298   False   False
1     B      1  17000.0  647.001708 -0.269421   False    True
2     B      2  18000.0  649.572265 -0.097540   False   False
3     B      4  19000.0  665.282775  0.008724    True   False
4     B      5  20000.0  690.836975 -0.011101   False   False

If we just want the row where True:

>>> df[df['jump_1']]
  col_0  col_1    col_2       col_3     col_4  jump_1  jump_2
3     B      4  19000.0  665.282775  0.008724    True   False

If we just want the indices in a list:

>>> df[df['jump_1']].index.to_list()
[3]
  • Related