Home > Blockchain >  Python pandas and numpy: assign numerical values to new variable based on multiple conditions for ex
Python pandas and numpy: assign numerical values to new variable based on multiple conditions for ex

Time:08-24

This is trivial in Excel, why is this so hard in Python?

The goal is to compute the state variable based on several conditions, including the previous value of the state variable as well. Value is the known integer, Min(3) and Max(3) are simply the minimum and maximum value of a 3 period rolling window shifted forward by one period. This is how far I got.

Index Value Max(3) Min(3) 
0     10    nan    nan    
1     20    nan    nan        
2     15    nan    nan         
3     25    20     10    
4     15    25     15     
5     10    25     15     
6     15    20     10         

What is the best way to calculate the state variable based on the following conditions:

  • a) If Value > Max(3) then 1
  • b) If Value < Min(3) then 4
  • c) If Value <= Max(3) & Value >= Min (3) & previous State = 1 or 2 then 2
  • d) If Value <= Max(3) & Value >= Min (3) & previous State = 4 or 3 then 3

Should look like this in final DataFrame:

Index Value Max(3) Min(3) State
0     10    nan    nan    nan
1     20    nan    nan    nan    
2     15    nan    nan    nan     
3     25    20     10     1
4     15    25     15     2
5     10    25     15     4
6     15    20     10     3    

I have mostly tried this using np.where() funtion but always run into problems once I approach c) and d) conditions.

CodePudding user response:

You can use this:

df.loc[df.Value.gt(df['Max(3)']), 'State'] = 1
df.loc[df.Value.lt(df['Min(3)']), 'State'] = 4
df.loc[df.Value.between(df['Min(3)'], df['Max(3)']) & (df.State.shift(1).isin((3, 4))), 'State'] = 3
df.loc[df.Value.between(df['Min(3)'], df['Max(3)']) & (df.State.shift(1).isin((1,2))), 'State'] = 2

output:
enter image description here

Explaination:
the first statement checks where df.Value is greater than df['Max(3)'] and creates a new column 'State' filled with NaNs, and only 1s on the location of the condition

the seconds line sets 4s where df.Value is smaller than df.['Min(3)']

The last two statements check if df.Value is within Max(3) and Min(3) and compares the df.State last value (.shift). Note: you can also use .between here instead of .isin if the numbers are subsequential.

CodePudding user response:

np.select can handle multiple conditions quite well with good readability

df['Value'] = df['Value'].astype(float)

conditions =
[
    df['Value']>df['Max(3)'],
    df['Value']<df['Min(3)'],
    (df['Value']<=df['Max(3)']) & (df['Value']>= df['Min(3)']) & (df['State'].shift().isin((1,2))),
    (df['Value']<=df['Max(3)']) & (df['Value']>= df['Min(3)']) & (df['State'].shift().isin((3,4)))
]

choicelist = [1, 4, 2, 3]

df['State'] = np.select(conditions, choicelist, np.nan)
print(df)
   Index  Value  Max(3)  Min(3)  State
0      0   10.0     NaN     NaN    NaN
1      1   20.0     NaN     NaN    NaN
2      2   15.0     NaN     NaN    NaN
3      3   25.0    20.0    10.0    1.0
4      4   15.0    25.0    15.0    2.0
5      5   10.0    25.0    15.0    4.0
6      6   15.0    20.0    10.0    3.0
  • Related