Home > database >  How to filter dataframe based on varying thresholds for indexes
How to filter dataframe based on varying thresholds for indexes

Time:10-27

I have a data frame and a dictionary like this:

thresholds = {'column':{'A':10,'B':11,'C':9}}

df:
    Column
A     13
A     7
A     11
B     12
B     14
B     14
C     7
C     8
C     11

For every index group, I want to calculate the count of values less than the threshold and greater than the threshold value. So my output looks like this:

df:
    Values<Thr  Values>Thr
A       1          2
B       0          3
C       2          1

Can anyone help me with this

CodePudding user response:

You can use:

import numpy as np

t = df.index.to_series().map(thresholds['column'])

out = (pd.crosstab(df.index, np.where(df['Column'].gt(t), 'Values>Thr', 'Values≤Thr'))
         .rename_axis(index=None, columns=None)
       )

Output:

   Values>Thr  Values≤Thr
A           2           1
B           3           0
C           1           2

syntax variant

out = (pd.crosstab(df.index, df['Column'].gt(t))
         .rename_axis(index=None, columns=None)
         .rename(columns={False: 'Values≤Thr', True: 'Values>Thr'})
      )

apply on many column based on the key in the dictionary

def count(s):
    t = s.index.to_series().map(thresholds.get(s.name, {}))
    return (pd.crosstab(s.index, s.gt(t))
         .rename_axis(index=None, columns=None)
         .rename(columns={False: 'Values≤Thr', True: 'Values>Thr'})
      )

out = pd.concat({c: count(df[c]) for c in df})

NB. The key of the dictionary must match exactly. I changed the case for the demo.

Output:

          Values≤Thr  Values>Thr
Column A           1           2
       B           0           3
       C           2           1

CodePudding user response:

Here another option:

import pandas as pd
df = pd.DataFrame({'Column': [13, 7, 11, 12, 14, 14, 7, 8, 11]})
df.index = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']

thresholds = {'column':{'A':10,'B':11,'C':9}}

df['smaller'] = df['Column'].groupby(df.index).transform(lambda x: x < thresholds['column'][x.name]).astype(int)
df['greater'] = df['Column'].groupby(df.index).transform(lambda x: x > thresholds['column'][x.name]).astype(int)

df.drop(columns=['Column'], inplace=True)
# group by index summing the greater and smaller columns
sums = df.groupby(df.index).sum()

sums

enter image description here

  • Related