R: apply a function to all rows that orders a subset and completes a calculation if a condition is m-CodePudding

i want to apply a function to every row, after ordering a subset of that row according dependent on an if/else statement. after which populating a new column in the dataframe with the results. i have over two million rows, so a for loop to do this is very inefficient.

given the following dataframe: (edited for r2evans)

df<-as.data.frame(cbind(matrix(LETTERS[1:3], ncol=1),matrix(sample.int(100,size=15),nrow=3,ncol=5))

> df
  V1 V2 V3 V4 V5 V6
1  A 77 79 32  5  4
2  B 57 24 68 65 45
3  C 66 60 82 74 22

is there a way to apply the following loop to each row without a for loop as i actually have over 2630800 rows?

df$num <- 0
df[2:7] <- sapply(df[2:7],as.numeric)
names(df) <- c("first_name", "sec", "A", "B", "C", "D", "num")

the names of the columns are required for the if statement below: (also edited to only sort once)

for (i in seq_len(nrow(df))) {
     row = sort(df[i,3:6])
     if (df[i,1]==names(row)[4]) {
         df$num[i] = row[3]/(row[3] row[4])
     } else {
         df$num[i] = row[4]/(row[3] row[4])
     }
 }

such that i get this outcome:

> df
  first_name sec  A  B  C  D       num
1          A  77 79 32  5  4 0.2882883
2          B  57 24 68 65 45 0.4887218
3          C  66 60 82 74 22  0.525641

i'm not sure how to do this with apply, was thinking something like this? although this does not work and i am not sure how to incorporate the if/else conditions:

df$num <- apply(df, 1, function(x) unlist(x[3:6][order(x[3:6])][3]/(x[3:6][order(x[3:6])][3] x[3:6][order(x[3:6])][4])))

CodePudding user response：

Here's a little brute-force approach with some hard-coding (column indices).

cols <- 3:6
sorted2 <- t(apply(df[,cols], 1, sort, decreasing = TRUE))[,1:2]
sorted2
#   [,1] [,2]
# 1   79   32
# 2   68   65
# 3   82   74
df$num <- ifelse(df[,1] == names(df)[cols][max.col(df[,cols])],
                 sorted2[,2], sorted2[,1]) /
            rowSums(sorted2)
df
#   first_name sec  A  B  C  D       num
# 1          A  77 79 32  5  4 0.2882883
# 2          B  57 24 68 65 45 0.4887218
# 3          C  66 60 82 74 22 0.5256410

Notes:

I define cols so that if you change the columns, you only need to change it once.
Your code tends to compare if first_name matches a column name, and from that determine if you use the "maximum" or the "second-maximum" values from the cols columns, and then divide that number by the sum of the two-max numbers. For this, I calculate sorted2 which always has the max in column 1 and the 2nd-max in column 2, making rowSums and max/2nd-max access immediate/easy.
I could have used df$first_name instead of df[,1], your choice.

This should work much better than a for loop or sapply, since it is completely vectorized.