Function that returns proportion of values exceeding threshold with several variables-CodePudding

Consider the dataframe below, where there are several variables, each with the same number of values (in this case, 4).

var1       var2    var3    var4 ...
1         10       -5      35   ...
2         11       -6      36   ...
3         12       -7      45   ...
4         13       -8      -6   ...

I would like to create a function that returns the proportion of values that are greater/less than the specified threshold values for several variables. The main goal is to create a function with the ability to enter however many variables, their corresponding operator ('<' or '>'), and their corresponding threshold values as desired.

For example, I was thinking of a function with the format (it doesn't have to be this format):

def proportion(variables,operators,thresholds):

where

proportion([var1,var2],['>','>'],[2, 10])

would yield 2/4, or 0.5, since the last two row values in the dataframe both exceed 2 for var1 and 10 for var2.

I've been stuck on this as I have no idea how to create such a function without hardcoding the function to take in a specific number of variables. I've attempted to use a for loop, which clearly isn't correct and not heading towards the right direction.

def proportion(variables,operators,thresholds):
   x=np.arange(len(variables))
   for i in x:
      if operators[i] == '>':
         return sum(x2 > thresholds[i] for x2 in df[variables[i]])/4
      else:
         return sum(x2 < thresholds[i] for x2 in df[variables[i]])/4

CodePudding user response：

We can create a function using the signature you specify, however instead of using loops we will use vectorized operations on each column and then use boolean conditions to filter down to the results we want. Explanations of what is happening is captured as comments in the answer.

import pandas as pd

# We create an operations map in order to map the function input to
# the attribute of a Series. Both DataFrames and Series have gt, lt, and eq
# operations that we can utilize which vectorize the operation for us
# instead of looping over every row.
op_map = {">": "gt", "<": "lt", "=": "eq"}

def proportions(variables, ops, thresholds, df):
    # zip the inputs in to a list of tuples where the format of
    # each item looks like `("var1", "<", 1)`
    constraints = zip(variables, ops, thresholds)
    results = []
    # iterating over our zipped tuples that we created above
    for constraint in constraints:
        v, op, t = constraint
        # map the operation input to the attribute name
        attr = op_map[op]
        # grab the relevant column from the DataFrame
        series = df[v]
        # grab the function from the Series we just created
        fn = getattr(series, attr)
        # find which rows meet the criteria and mark them True
        # otherwise they get marked as False
        result = fn(t)
        # append the final results to a list
        results.append(result)
    # concat all the columns together and sum up where every row in
    # the DataFrame is True, i.e all the thresholds were met
    total = pd.concat(results, axis=1).all(axis="columns").sum()
    # obtain the final answer by dividing by the length of the DataFrame
    return total / len(df)

proportions(["var1", "var2"], [">", ">"], [2, 10], df)
> 0.5

proportions(["var1", "var2", "var3"], [">", ">", "="], [2, 10, -7], df)
> 0.25

CodePudding user response：

This function uses eval to check if the statement is true.

def proportion(variables,operators,thresholds):
    lst = []
    n = len(variables)
    for v in variables:
        y = True
        for x in range(n):
            var = v[x]
            op = operators[x]
            tr = thresholds[x]
            z = eval(f'{var} {op} {tr}')
            if y:
                y = z
        lst.append(y)
    return lst.count(True) / n

var1 = [1, 2, 3, 4]
var2 = [10, 11, 12, 13]

print(proportion([var1,var2],['>','>'],[2, 10]))

Output: 0.5