So I have this dataframe
databackend <- df(Name = c("A1", "A2", "B1", "B2", "B3", "D2", "F6") Comp1 = c("x", NA, NA, NA, NA, NA, NA), Con2 = c(NA, "x", NA, NA, NA, NA, NA), Vis3 = c(NA, NA, "x", NA, NA, NA, NA), Tra4 = c(NA, NA, NA, NA, "x", NA, NA), Pred5 = c(NA, NA, NA, "x", NA, NA, "x"), Adap6 = (NA, NA, NA, NA, NA, "x", NA))
I want to add a column to databackend that displays a value from 1 to 6 based on which column the "x" is on databackend. So the additional column would look like this
databackend$stage = NA
databackend$stage = c(1, 2, 3, 5, 4, 6, 5)
since my dataframe is very large in the original script, I am looking for the fastest (automatic) way to do this. I´ve tried using the for loop but it takes too long.
CodePudding user response:
Relatively simple
> tmp=which(databackend[,-1]=="x",arr.ind=T)
> tmp[order(tmp[,"row"]),"col"]
[1] 1 2 3 5 4 6 5
CodePudding user response:
You can do (assuming as in your example a single "x" in every row):
max.col(!is.na(databackend[-1]))
[1] 1 2 3 5 4 6 5
CodePudding user response:
Using which
and apply
:
apply(databackend[-1], 1, \(x) which(x == "x"))
#[1] 1 2 3 5 4 6 5
A benchmark, max.col
is the fastest:
microbenchmark::microbenchmark(
apply = apply(databackend[-1], 1, \(x) which(x == "x")),
which = {tmp=which(databackend[,-1]=="x",arr.ind=T)
tmp[order(tmp[,"row"]),"col"]},
max.col = max.col(!is.na(databackend[-1]))
)
Unit: microseconds
expr min lq mean median uq max neval
apply 149.4 165.95 232.308 196.20 216.95 2882.4 100
which 118.9 144.35 184.684 158.10 190.45 907.0 100
max.col 51.5 73.00 88.302 79.45 94.40 326.1 100