find if column value reached upper limit or lower limit first in pandas-CodePudding

I have a pandas dataframe given below

id  val   ulim  llim  
 1  100.25 101   98
 2   97.30 99    95
 3  104.22 106   100
 4  105.00 107   102
 5  95.00  99    91
 .. ....   ..    ..
 100000  105.92 107   103

For each row, I need to find if upper limit(ulim) has been reached first or the lower limit(llim) has been reached first.

For example:

For the first row, the Value (val) is 100.25 , the upper limit is 101 and the lower limit is 98.

The value of the second row, 97.30 is lesser than the lower limit (llim) . Hence , I will mark this row as (-1).

For the second row, the value(val) is 97.30, the upper limit is 99 and lower limit is 95. The value of the third row, 104.22 is higher than the upper limit. Hence , this row will be marked as (1).

For the third row, the value(val) is 104.22. The upper limit is 106 and the lower limit is 100. The value in the fourth row (105) is in between upper limit and lower limit. Hence, we will move to the the fifth row where the value is 95 and its below the lower limit(100). Hence, this row will be marked as -1.

Target df would be as follows

id  val   ulim  llim result
 1  100.25 101   98   -1
 2   97.30 99    95    1
 3  104.22 106   100  -1
 4  105.00 107   102  -1 
 5  95.00  99    91    1 
 .. ....   ..    ..
 100000  105.92 107  103 NaN

I have more than a million rows like this. Is it possible to have a solution without iteration?

The current iterative solution which I tried is very slow and is as follows

Loop through each row in the data frame
Find the row index and slice the data frame as df.iloc[current_index:]
Take the val column from the sliced data frame and convert it to a list (sliced_df["val"].tolist())
A list comprehension to check if the first upper limit was reached or the lower limit was reached.
mark the result column based on step 4 result.

CodePudding user response：

You need several things to complete this task: shift to shift the values up, np.select to map the values, then bfill to back fill the values when needed:

# first we shift the values up by 1 row
shifted = df['val'].shift(-1)

# temporary compare the shifted value to the limits,
# `nan` indicates where the values are within limit
df['result'] = np.select((shifted > df['ulim'], shifted < df['llim']), (1,-1), np.nan)

Here, we almost have what we want:

       id     val  ulim  llim  result
0       1  100.25   101    98    -1.0
1       2   97.30    99    95     1.0
2       3  104.22   106   100     NaN
3       4  105.00   107   102    -1.0
4       5   95.00    99    91     1.0
5  100000  105.92   107   103     NaN

except for the NaN values at id==3. With this data, we can back fill val, then compare again:

shifted = shifted.mask(df['result'].isna()).bfill()

Now the shifted series is (notice that the 95 is shifted to the row 2 as well):

0     97.30
1    104.22
2     95.00
3     95.00
4    105.92
5       NaN
Name: val, dtype: float64

And we can repeat the assignment, fillna is to fill where the data is missing in result:

df['result'] = df['result'].fillna(np.select((shifted > df['ulim'], shifted < df['llim']), (1,-1), np.nan))

Output:

       id     val  ulim  llim  result
0       1  100.25   101    98    -1.0
1       2   97.30    99    95     1.0
2       3  104.22   106   100    -1.0
3       4  105.00   107   102    -1.0
4       5   95.00    99    91     1.0
5  100000  105.92   107   103     NaN

Note: this works with the sample data. However, you may want to repeat the process until the result column (or the shifted series) doesn't change/update.

CodePudding user response：

You simply need to create new col with shift(-1) then compare them to upper and lower limit

df1['valnew'] = df1['val'].shift(-1)
df1['check'] = np.where(df1['valnew']>=df1['ul'],1,-1)
df1