Home > OS >  Multiple if conditions on pandas dataframe with different thresholds
Multiple if conditions on pandas dataframe with different thresholds

Time:12-12

I have a dataframe with several parameters:

par1      par2      par3      par4      par5       
1.122208  1.054132  1.133250  1.114845  1.183850
1.076445  1.128663  0.998518  1.081816  1.006934
1.077058  1.561871  1.045255  1.120456  1.768667
0.904869  1.183985  0.938095  0.927841  1.201934
0.876596  1.044014  0.877457  0.871429  0.990452
...

The value of each parameter needs to be checked against a specific threshold. I need to check whether at least two of the above parameters are above the aforementioned thresholds. It does not matter which parameters are above the threshold, as long as there are at least two of them. Note that par1 has a threshold1, par2 a threshold2 and so on, with threshold1 different from threshold2,..., threshold5 and so on.

So far I have written an ugly nested if condition, but I was wondering what would be the best approach here.

CodePudding user response:

Does this help solve your problem?

df = pd.DataFrame(
  {
    'par1': [1.122208, 1.076445, 1.077058, 0.904869, 0.876596],
    'par2': [1.054132, 1.128663, 1.561871, 1.183985, 1.044014],
    'par3': [1.133250, 0.998518, 1.045255, 0.938095, 0.877457],
    'par4': [1.114845, 1.081816, 1.120456, 0.927841, 0.871429],
    'par5': [1.183850, 1.006934, 1.768667, 1.201934, 0.990452],
  }
)

thresholds = {
  'par1': 0.5,
  'par2': 3,
  'par3': 1.2,
  'par4': 1.1,
  'par5': 3,
}

def check_thresholds(input_row):
  no_over_threshold = sum(
    [value > thresholds[col_name] for col_name, value in input_row.items()]
  )

  if no_over_threshold >= 2:
    return True
  else:
    return False

df['above_thresholds'] = df.apply(check_thresholds, axis=1)

Example output:

enter image description here

CodePudding user response:

Using Kelvin Ducray's sample data, we can take the solution a step further, to avoid the for-loop/apply, and use Pandas' vectorized operations, which should be faster:

thresholds = pd.Series(thresholds)

# compare df with thresholds
# sum accross the booleans
# check True or False for >=2
above_thresholds = df.gt(thresholds).sum(1).ge(2)

df.assign(above_thresholds = above_thresholds)

       par1      par2      par3      par4      par5  above_thresholds
0  1.122208  1.054132  1.133250  1.114845  1.183850              True
1  1.076445  1.128663  0.998518  1.081816  1.006934             False
2  1.077058  1.561871  1.045255  1.120456  1.768667              True
3  0.904869  1.183985  0.938095  0.927841  1.201934             False
4  0.876596  1.044014  0.877457  0.871429  0.990452             False
  • Related