I have a dataframe with 3 columns: a_id, b, c (with a_id as a unique key) and I would like to assign a score for each row based on the number in b and c columns. I have created the following:
def b_score_function(df):
if df['b'] <= 0 :
return 0
elif df['b'] <= 2 :
return 0.25
else:
return 1
def c_score_function(df):
if df['c'] <= 0 :
return 0
elif df['c'] <= 1 :
return 0.5
else:
return 1
Normally, I would use something like this:
df['b_score'] = df(b_score, axis = 1)
df['c_score'] = df(c_score, axis = 1)
However, the above approach will be too long if I have multiple columns. I would like to know how can I create a loop for the selected columns? I have tried the following:
ds_cols = df.columns.difference(['a_id']).to_list()
for col in ds_cols:
df[f'{col}_score'] = df.apply(f'{col}_score_function', axis = 1)
but it returned with the following error:
'b_score_function' is not a valid function for 'DataFrame' object
Can anyone please point out what I did wrong? Also if anyone can suggest how to create a reusable, that would be appreciated.
Thank you.
CodePudding user response:
For a vectorial way in a single shot, you can use dictionaries to hold the threshold and replacement values, then numpy.select
:
# example input
df = pd.DataFrame({'b': [-1, 2, 5],
'c': [5, -1, 1]})
# dictionaries (one key:value per column)
thresh = {'b': 2, 'c': 1}
repl = {'b': 0.25, 'c': 0.5}
out = pd.DataFrame(
np.select([df.le(0), df.le(thresh)],
[0, pd.Series(repl)],
1),
columns=list(thresh),
index=df.index
).add_suffix('_score')
output:
b_score c_score
0 0.00 1.0
1 0.25 0.0
2 1.00 0.5
CodePudding user response:
IIUC, this should work for you:
df = pd.DataFrame({'a_id': range(5), 'b': [0.0, 0.25, 0.5, 2.0, 2.5], 'c': [0.0, 0.25, 0.5, 1.0, 1.5]})
def b_score_function(df):
if df['b'] <= 0 :
return 0
elif df['b'] <= 2 :
return 0.25
else:
return 1
def c_score_function(df):
if df['c'] <= 0 :
return 0
elif df['c'] <= 1 :
return 0.5
else:
return 1
ds_cols = df.columns.difference(['a_id']).to_list()
for col in ds_cols:
df[f'{col}_score'] = df.apply(eval(f'{col}_score_function'), axis = 1)
print(df)
Result:
a_id b c b_score c_score
0 0 0.00 0.00 0.00 0.0
1 1 0.25 0.25 0.25 0.5
2 2 0.50 0.50 0.25 0.5
3 3 2.00 1.00 0.25 0.5
4 4 2.50 1.50 1.00 1.0
CodePudding user response:
The problem with your attempt is that pandas cannot access your functions from strings with the same name. For example, you need to pass df.apply(b_score_function, axis=1)
, and not df.apply("b_score_function", axis=1)
(note the double quotes).
My first thought would be to link the column names to functions with a dictionary:
funcs = {'b' : b_score_function,
'c' : c_score_function}
for col in ds_cols:
foo = funcs[col]
df[f'{col}_score'] = df.apply(foo, axis = 1)
Typing out the dictionary funcs
may be tedious or infeasible depending on how many columns/functions you have. If that is the case, you may have to find additional ways to automate the creation and access of your column-specific functions.
One somewhat automatic way is to use locals()
or globals()
- these will return dictionaries which have the functions you defined (as well as other things):
for col in ds_cols:
key = f"{col}_score_function"
foo = locals()[key]
df.apply(foo, axis=1)
This code is dependent on the fact that the function for column "X"
is called X_score_function()
, but that seems to be met in your example. It also requires that every column in ds_cols
will have a corresponding entry in locals()
.
Somewhat confusingly there are some functions which you can access by passing a string to apply
, but these are only the ones that are shortcuts for numpy functions, like df.apply('sum')
or df.apply('mean')
. Documentation for this appears to be absent. Generally you would want to do df.sum()
rather than df.apply('sum')
, but sometimes being able to access the method by the string is convenient.