Suboptimal use of nested for loops in R. Options for vectorization/optimization?-CodePudding

I have a dataset that stores instances for participants vertically over time. They can have basically any number of follow-ups, there are participants with anywhere from 1 to 14 lines, but more are expected to be added with time.

I have a list of variables var that the participants have presumably reported in each follow-up and want to create a new set of "ever" variables vare that describe if at any time before this follow-up, a participant reported "yes" for the corresponding variable.

Here is an example of the desired input/output:

var  = c("var1","var2")
vare = paste0(var,"_ever")

data = data.frame(idno         = c(123,123,123,123,123,123,123)
                  followup_num = c(0,1,2,3,4,5,6)
                  var1         = c(0,NA,0,1,0,NA,1)
                  var2         = c(1,NA,NA,0,0,0,1)
                 )         
data$var1_ever = c(0,0,0,1,1,1,1)
data$var2_ever = c(1,1,1,1,1,1,1)

idno	followup_num	var1	var1_ever	var2	var2_ever
123	0	0	0	1	1
123	1	NA	0	NA	1
123	2	0	0	NA	1
123	3	1	1	0	1
123	4	0	1	0	1
123	5	NA	1	0	1
123	6	1	1	1	1

This is the code I am currently using. Obviously, nested for loops are not ideal in R and this segment of code is particularly slow when handed a few thousand lines.

#For each ID
for (i in unique(data$idno)) {

  id  = data$idno%in%i              #Get the relevant lines for this ID
  fus = sort(data$followup_num[id]) #Get the follow-up numbers
  
  #For each variable in the list
  for (v in seq_along(var)) {

    #Loop through the follow-ups. If you see that the variable reports "yes", mark 
    #  this and every proceeding follow-up as having reported that variable ever 
    #  Otherwise, mark the opposite at that line and move to the next follow-up
    for (f in fus) {
      if (t(data[id & data$followup_num%in%f,var[v]])%in%1) {
        data[id & data$followup_num >= f,vare[v]] = 1
        break
      } else {
        data[id & data$followup_num%in%f,vare[v]] = 0
      }
    }    
  }
}

Is this a problem with an existing solution? Is there a way to optimize/simplify? Is there a use of apply/sapply/etc. functions that I neglected to try?

CodePudding user response：

At its core, the solution is the base function cummax(). We need to take into account NA, so I added replace_na(). And we need to account for additional idno's by using group_by()

A minimal vectorized solution is

df$var1_test<-cummax(x=replace_na(df$var1, 0))

This is a great problem to solve with the tidyverse mutate across function set!

df = data.frame(idno         = c(123,123,123,123,123,123,123),
                  followup_num = c(0,1,2,3,4,5,6),
                  var1         = c(0,NA,0,1,0,NA,1),
                  var2         = c(1,NA,NA,0,0,0,1))

df %>% group_by(idno) %>%  
       arrange(idno, followup_num) %>% 
       mutate(across(.cols=starts_with("var"), 
                     .fns= ~ cummax(tidyr::replace_na(.x, 0)), 
                     .names="{.col}_ever2"))

   idno followup_num  var1  var2 var1_ever2 var2_ever2
1   123            0     0     1          0          1
2   123            1    NA    NA          0          1
3   123            2     0    NA          0          1
4   123            3     1     0          1          1
5   123            4     0     0          1          1
6   123            5    NA     0          1          1
7   123            6     1     1          1          1

Alternativly, if you want to summarize the data to a single row, then a grouped max works

df %>%
  group_by(idno) %>%
  summarise(across(.cols=starts_with("var"), 
                   .fns= ~ max(.x, na.rm=T), 
                   .names="{.col}_ever3"))

   idno var1_ever3 var2_ever3
1   123          1          1

ps. data is an internal function, better to call variable df.

CodePudding user response：

Consider ave cummax (with ifelse to handle NAs):

data <- within(
  data, {
    var2_ever <- ave(var2, idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
    var1_ever <- ave(var1, idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
  }
)

For many columns:

vars <- names(data)[grep("var", names(data))]

data[paste0(vars, "_ever")] <- sapply(
  vars, \(var) ave(data[[var]], data$idno, FUN=\(x) cummax(ifelse(is.na(x), 0, x)))
)

idno	followup_num	var1	var1_ever	var2	var2_ever
123	0	0	0	1	1
123	1	NA	0	NA	1
123	2	0	0	NA	1
123	3	1	1	0	1
123	4	0	1	0	1
123	5	NA	1	0	1
123	6	1	1	1	1

idno	followup_num	var1	var1_ever	var2	var2_ever
123	0	0	0	1	1
123	1	NA	0	NA	1
123	2	0	0	NA	1
123	3	1	1	0	1
123	4	0	1	0	1
123	5	NA	1	0	1
123	6	1	1	1	1

idno	followup_num	var1	var1_ever	var2	var2_ever
123	0	0	0	1	1
123	1	NA	0	NA	1
123	2	0	0	NA	1
123	3	1	1	0	1
123	4	0	1	0	1
123	5	NA	1	0	1
123	6	1	1	1	1