Home > Back-end >  Pandas New Column from Conditions Across Multiple Columns
Pandas New Column from Conditions Across Multiple Columns

Time:12-29

This is stumping me with pandas.. I have a data frame with 5.8M rows and a date index. I 5 columns A, B, C, D & E and would simply like to create a new column F_Score based on simple math below:

F_Score=0
if A > = B:
    F_Score = 1.0 
else:
    F_Score= -1.0

if B > = C:
    F_Score = F_Score   1.0 
else:
    F_Score = F_Score - 1.0

if C > = D: 
    F_Score = F_Score   1.0 
else:
    F_Score = F_Score-1.0

if D > = E:
    F_Score = F_Score   1.0 
else:
    F_Score = F_Score -1.0

I cannot seem to find a simple way to do this. Any suggestions?

I tried turning all of the columns into numpy arrays and then doing a for loop storing the result into a new array and then creating a column F for the dataframe. This seems very inefficient. Is there a better way?

CodePudding user response:

np.sum(-np.sign(np.diff(df.to_numpy(), axis=1)), axis=1)

From the if chains, it seems like if we take the column-wise successive differences and look at their sign:

  • if sign is positive, subtract 1
  • if sign is negative, add 1 (note that the difference is taken from right to left).

So we go to the NumPy domain, take the column-wise difference, get the signs, negate them, and their summation is the desired F_score here.

pandas supports diff too, but since the problem is label-agnostic and you mentioned you have a lot of rows, going to NumPy domain should make it faster, and hopefully fast enough.

CodePudding user response:

You could put your math in a function and apply it with apply: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

should look something like that:

def logic(a,b,c,d,e):
    do something...
    return f

df["f"] = df.apply(lambda row: logic(row["a"], row["b"], row["c"], row["d"], row["e"]), axis=1)
  • Related