Iterate function using apply for similar column name-CodePudding

I have a dataframe with 3 columns: a_id, b, c (with a_id as a unique key) and I would like to assign a score for each row based on the number in b and c columns. I have created the following:

def b_score_function(df):
    if df['b'] <= 0 :
        return 0
    elif df['b'] <= 2 :
        return 0.25
    else: 
        return 1

def c_score_function(df): 
    if df['c'] <= 0 :
        return 0
    elif df['c'] <= 1 :
        return 0.5
    else: 
        return 1

Normally, I would use something like this:

df['b_score'] = df(b_score, axis = 1)
df['c_score'] = df(c_score, axis = 1)

However, the above approach will be too long if I have multiple columns. I would like to know how can I create a loop for the selected columns? I have tried the following:

ds_cols = df.columns.difference(['a_id']).to_list() 

for col in ds_cols:
    df[f'{col}_score'] = df.apply(f'{col}_score_function', axis = 1)

but it returned with the following error:

'b_score_function' is not a valid function for 'DataFrame' object

Can anyone please point out what I did wrong? Also if anyone can suggest how to create a reusable, that would be appreciated.

Thank you.

CodePudding user response：

For a vectorial way in a single shot, you can use dictionaries to hold the threshold and replacement values, then numpy.select:

# example input
df = pd.DataFrame({'b': [-1, 2, 5],
                   'c': [5, -1, 1]})

# dictionaries (one key:value per column)
thresh = {'b': 2, 'c': 1}
repl = {'b': 0.25, 'c': 0.5}

out = pd.DataFrame(
    np.select([df.le(0), df.le(thresh)],
              [0, pd.Series(repl)],
              1),
    columns=list(thresh),
    index=df.index
).add_suffix('_score')

output:

   b_score  c_score
0     0.00      1.0
1     0.25      0.0
2     1.00      0.5

CodePudding user response：

IIUC, this should work for you:

df = pd.DataFrame({'a_id': range(5), 'b': [0.0, 0.25, 0.5, 2.0, 2.5], 'c': [0.0, 0.25, 0.5, 1.0, 1.5]})

def b_score_function(df):
    if df['b'] <= 0 :
        return 0
    elif df['b'] <= 2 :
        return 0.25
    else: 
        return 1

def c_score_function(df): 
    if df['c'] <= 0 :
        return 0
    elif df['c'] <= 1 :
        return 0.5
    else: 
        return 1


ds_cols = df.columns.difference(['a_id']).to_list() 
for col in ds_cols:
    df[f'{col}_score'] = df.apply(eval(f'{col}_score_function'), axis = 1)
print(df)

Result:

   a_id     b     c  b_score  c_score
0     0  0.00  0.00     0.00      0.0
1     1  0.25  0.25     0.25      0.5
2     2  0.50  0.50     0.25      0.5
3     3  2.00  1.00     0.25      0.5
4     4  2.50  1.50     1.00      1.0

CodePudding user response：

The problem with your attempt is that pandas cannot access your functions from strings with the same name. For example, you need to pass df.apply(b_score_function, axis=1), and not df.apply("b_score_function", axis=1) (note the double quotes).

My first thought would be to link the column names to functions with a dictionary:

funcs = {'b' : b_score_function,
         'c' : c_score_function}

for col in ds_cols:
    foo = funcs[col]
    df[f'{col}_score'] = df.apply(foo, axis = 1)

Typing out the dictionary funcs may be tedious or infeasible depending on how many columns/functions you have. If that is the case, you may have to find additional ways to automate the creation and access of your column-specific functions.

One somewhat automatic way is to use locals() or globals() - these will return dictionaries which have the functions you defined (as well as other things):

for col in ds_cols:
    key = f"{col}_score_function"
    foo = locals()[key]
    df.apply(foo, axis=1)

This code is dependent on the fact that the function for column "X" is called X_score_function(), but that seems to be met in your example. It also requires that every column in ds_cols will have a corresponding entry in locals().

Somewhat confusingly there are some functions which you can access by passing a string to apply, but these are only the ones that are shortcuts for numpy functions, like df.apply('sum') or df.apply('mean'). Documentation for this appears to be absent. Generally you would want to do df.sum() rather than df.apply('sum'), but sometimes being able to access the method by the string is convenient.