This is stumping me with pandas.. I have a data frame with 5.8M rows and a date index. I 5 columns A, B, C, D & E and would simply like to create a new column F_Score based on simple math below:
F_Score=0
if A > = B:
F_Score = 1.0
else:
F_Score= -1.0
if B > = C:
F_Score = F_Score 1.0
else:
F_Score = F_Score - 1.0
if C > = D:
F_Score = F_Score 1.0
else:
F_Score = F_Score-1.0
if D > = E:
F_Score = F_Score 1.0
else:
F_Score = F_Score -1.0
I cannot seem to find a simple way to do this. Any suggestions?
I tried turning all of the columns into numpy arrays and then doing a for loop storing the result into a new array and then creating a column F for the dataframe. This seems very inefficient. Is there a better way?
CodePudding user response:
np.sum(-np.sign(np.diff(df.to_numpy(), axis=1)), axis=1)
From the if chains, it seems like if we take the column-wise successive differences and look at their sign:
- if sign is positive, subtract 1
- if sign is negative, add 1 (note that the difference is taken from right to left).
So we go to the NumPy domain, take the column-wise difference, get the signs, negate them, and their summation is the desired F_score here.
pandas supports diff
too, but since the problem is label-agnostic and you mentioned you have a lot of rows, going to NumPy domain should make it faster, and hopefully fast enough.
CodePudding user response:
You could put your math in a function and apply it with apply: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
should look something like that:
def logic(a,b,c,d,e):
do something...
return f
df["f"] = df.apply(lambda row: logic(row["a"], row["b"], row["c"], row["d"], row["e"]), axis=1)