I'm trying to add a "conditional" column to my dataframe. I can do it with a for loop but I understand this is not efficient. Can my code be simplified and made more efficient? (I've tried masks but I can't get my head around the syntax as I'm a relative newbie to python).
import pandas as pd
path = (r"C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards")
hist_file = r"\x3RC_trnhist.xlsx"
racecard_path = path hist_file
df = pd.read_excel(racecard_path)
df["Mask"] = df["HxFPos"].copy
df["Total"] = df["HxFPos"].copy
cnt = -1
for trn in df["HxRun"]:
cnt = cnt 1
if df.loc[cnt,"HxFPos"] > 6 or df.loc[cnt,"HxTotalBtn"] > 30:
df.loc[cnt,"Mask"] = 0
elif df.loc[cnt,"HxFPos"] < 2 and df.loc[cnt,"HxRun"] < 4 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 4 and df.loc[cnt,"HxRun"] < 9 and df.loc[cnt,"HxTotalBtn"] < 10:
df.loc[cnt,"Mask"] = 1
elif df.loc[cnt,"HxFPos"] < 5 and df.loc[cnt,"HxRun"] < 20 and df.loc[cnt,"HxTotalBtn"] < 20:
df.loc[cnt,"Mask"] = 1
else:
df.loc[cnt,"Mask"] = 0
df.loc[cnt,"Total"] = df.loc[cnt,"Mask"] * df.loc[cnt,"HxFPos"]
df.to_excel(r'C:\Users\chris\Documents\UKHR\PythonSand\PY_Scripts\CleanModules\Racecards\cond_col.xlsx', index = False)
Sample data/output:
HxRun HxFPos HxTotalBtn Mask Total
7 5 8 0 0
13 3 2.75 1 3
12 5 3.75 0 0
11 5 5.75 0 0
11 7 9.25 0 0
11 9 14.5 0 0
10 10 26.75 0 0
8 4 19.5 1 4
8 8 67 0 0
CodePudding user response:
Use df.assign()
for a complex vectorized expression
Use vectorized pandas operators and methods, where possible; avoid iterating. You can do a complex vectorized expression/assignment like this with:
.loc[]
df.assign()
- or alternatively
df.query
(if you like SQL syntax)
or if you insist on doing it by iteration (you shouldn't), you never need to use an explicit for-loop with .loc[]
as you did, you can use:
df.apply(your_function_or_lambda, axis=1)
- or
df.iterrows()
as a fallback
df.assign()
(or df.query
) are going to be less grief when you have long column names (as you do) which get used repreatedly in a complex expression.
Solution with df.assign()
Rewrite your fomula for clarity
When we remove all the unneeded .loc[]
calls your formula boils down to:
HxFPos > 6 or HxTotalBtn > 30:
Mask = 0
HxFPos < 2 and HxRun < 4 and HxTotalBtn < 10:
Mask = 1
HxFPos < 4 and HxRun < 9 and HxTotalBtn < 10:
Mask = 1
HxFPos < 5 and HxFPos < 20 and HxTotalBtn < 20:
Mask = 1
else:
Mask = 0
pandas doesn't have a native case-statement/method.
Renaming your variables HxFPos->f
, HxFPos->r
, HxTotalBtn->btn
for clarity:
(f > 6) or (btn > 30):
Mask = 0
(f < 2) and (r < 4) and (btn < 10):
Mask = 1
(f < 4) and (r < 9) and (btn < 10):
Mask = 1
(f < 5) and (r < 20) and (btn < 20):
Mask = 1
else:
Mask = 0
So really the whole boolean expression for Mask
is gated by (f <= 6) or (btn <= 30)
. (Actually your clauses imply you can only have Mask=1 for (f < 5) and (r < 20) and (btn < 20), if you want to optimize further.)
Mask = ((f<= 6) & (btn <= 30)) & ... you_do_the_rest
Vectorize your expressions
So, here's a vectorized rewrite of your first line. Note that comparisons > and < are vectorized, that the vectorized boolean operators are | and & (instead of 'and', 'or'), and you need to parenthesize your comparisons to get the operator precedence right:
>>> (df['HxFPos']>6) | (df['HxTotalBtn']>30)
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 False
8 True
dtype: bool
Now that output is a logical expression (vector of 8 bools); you can use that directly in df.loc[logical_expression_for_row, 'Mask']
.
Similarly:
((df['HxFPos']<2) & (df['HxRun']<4)) & (df['HxTotalBtn']<10)
CodePudding user response:
Edit - this is where I found an answer: Pandas conditional creation of a series/dataframe column
by @Hossein-Kalbasi
I've just found an answer - please comment if this is not the most efficient.
df.loc[(((df['HxFPos']<3)&(df['HxRun']<5)|(df['HxRun']>4)&(df['HxFPos']<5)&(df['HxRun']<9)|(df['HxRun']>8)&(df['HxFPos']<6)&(df['HxRun']<30))&(df['HxTotalBtn']<30)), 'Mask'] = 1