What is the fastest way to add new column based on dataframe entries in specific columns-CodePudding

So I have this dataframe

databackend <- df(Name = c("A1", "A2", "B1", "B2", "B3", "D2", "F6") Comp1 = c("x", NA, NA, NA, NA, NA, NA), Con2 = c(NA, "x", NA, NA, NA, NA, NA), Vis3 = c(NA, NA, "x", NA, NA, NA, NA), Tra4 = c(NA, NA, NA, NA, "x", NA, NA), Pred5 = c(NA, NA, NA, "x", NA, NA, "x"), Adap6 = (NA, NA, NA, NA, NA, "x", NA))

I want to add a column to databackend that displays a value from 1 to 6 based on which column the "x" is on databackend. So the additional column would look like this

databackend$stage = NA
databackend$stage = c(1, 2, 3, 5, 4, 6, 5)

since my dataframe is very large in the original script, I am looking for the fastest (automatic) way to do this. I´ve tried using the for loop but it takes too long.

CodePudding user response：

Relatively simple

> tmp=which(databackend[,-1]=="x",arr.ind=T)
> tmp[order(tmp[,"row"]),"col"]
[1] 1 2 3 5 4 6 5

CodePudding user response：

You can do (assuming as in your example a single "x" in every row):

max.col(!is.na(databackend[-1]))

[1] 1 2 3 5 4 6 5

CodePudding user response：

Using which and apply:

apply(databackend[-1], 1, \(x) which(x == "x"))
#[1] 1 2 3 5 4 6 5

A benchmark, max.col is the fastest:

microbenchmark::microbenchmark(
  apply = apply(databackend[-1], 1, \(x) which(x == "x")),
  which = {tmp=which(databackend[,-1]=="x",arr.ind=T)
  tmp[order(tmp[,"row"]),"col"]},
  max.col = max.col(!is.na(databackend[-1]))
)

Unit: microseconds
    expr   min     lq    mean median     uq    max neval
   apply 149.4 165.95 232.308 196.20 216.95 2882.4   100
   which 118.9 144.35 184.684 158.10 190.45  907.0   100
 max.col  51.5  73.00  88.302  79.45  94.40  326.1   100