I have a pandas dataframe given below
id val ulim llim
1 100.25 101 98
2 97.30 99 95
3 104.22 106 100
4 105.00 107 102
5 95.00 99 91
.. .... .. ..
100000 105.92 107 103
For each row, I need to find if upper limit(ulim) has been reached first or the lower limit(llim) has been reached first.
For example:
For the first row, the Value (val) is 100.25 , the upper limit is 101 and the lower limit is 98.
The value of the second row, 97.30 is lesser than the lower limit (llim) . Hence , I will mark this row as (-1).
For the second row, the value(val) is 97.30, the upper limit is 99 and lower limit is 95. The value of the third row, 104.22 is higher than the upper limit. Hence , this row will be marked as (1).
For the third row, the value(val) is 104.22. The upper limit is 106 and the lower limit is 100. The value in the fourth row (105) is in between upper limit and lower limit. Hence, we will move to the the fifth row where the value is 95 and its below the lower limit(100). Hence, this row will be marked as -1.
Target df would be as follows
id val ulim llim result
1 100.25 101 98 -1
2 97.30 99 95 1
3 104.22 106 100 -1
4 105.00 107 102 -1
5 95.00 99 91 1
.. .... .. ..
100000 105.92 107 103 NaN
I have more than a million rows like this. Is it possible to have a solution without iteration?
The current iterative solution which I tried is very slow and is as follows
- Loop through each row in the data frame
- Find the row index and slice the data frame as df.iloc[current_index:]
- Take the val column from the sliced data frame and convert it to a list (sliced_df["val"].tolist())
- A list comprehension to check if the first upper limit was reached or the lower limit was reached.
- mark the result column based on step 4 result.
CodePudding user response:
You need several things to complete this task: shift
to shift the values up, np.select
to map the values, then bfill
to back fill the values when needed:
# first we shift the values up by 1 row
shifted = df['val'].shift(-1)
# temporary compare the shifted value to the limits,
# `nan` indicates where the values are within limit
df['result'] = np.select((shifted > df['ulim'], shifted < df['llim']), (1,-1), np.nan)
Here, we almost have what we want:
id val ulim llim result
0 1 100.25 101 98 -1.0
1 2 97.30 99 95 1.0
2 3 104.22 106 100 NaN
3 4 105.00 107 102 -1.0
4 5 95.00 99 91 1.0
5 100000 105.92 107 103 NaN
except for the NaN
values at id==3
. With this data, we can back fill val
, then compare again:
shifted = shifted.mask(df['result'].isna()).bfill()
Now the shifted
series is (notice that the 95
is shifted to the row 2 as well):
0 97.30
1 104.22
2 95.00
3 95.00
4 105.92
5 NaN
Name: val, dtype: float64
And we can repeat the assignment, fillna
is to fill where the data is missing in result
:
df['result'] = df['result'].fillna(np.select((shifted > df['ulim'], shifted < df['llim']), (1,-1), np.nan))
Output:
id val ulim llim result
0 1 100.25 101 98 -1.0
1 2 97.30 99 95 1.0
2 3 104.22 106 100 -1.0
3 4 105.00 107 102 -1.0
4 5 95.00 99 91 1.0
5 100000 105.92 107 103 NaN
Note: this works with the sample data. However, you may want to repeat the process until the result
column (or the shifted
series) doesn't change/update.
CodePudding user response:
You simply need to create new col with shift(-1)
then compare them to upper and lower limit
df1['valnew'] = df1['val'].shift(-1)
df1['check'] = np.where(df1['valnew']>=df1['ul'],1,-1)
df1