Need to identify all rows meeting a complex logical condition in R-CodePudding

I have a datatable / dataframe with some columns of logical vectors - each logical vector represents the presence or absence of a species in a weekly survey. I want to identify those rows where the species is present over a 21 day period or more - the period that the species isn't recorded in between is irrelevant).

This means I need to identify all those rows where a 'TRUE' is present in at least 2 columns (only those columns with "detected" in the colname in example below) that are 3 or more columns apart (because each column represents a weekly survey).

So if e.g. V1 and V4 are TRUE, my 'result' column is TRUE. If only V6,V7,V8 are TRUE, my result column is FALSE. If V2 and V9 are TRUE, my result is TRUE etc.

This is part of a bigger simulation, and the number of columns in the dt (and the number of weekly surveys) varies depending on other simulation parameters, so it isn't possible to do it using column indexing. But the columns of interest can all be given the same suffix ('detected') as in the example code below.

There are also other TRUE/FALSE columns in the dt (that aren't suffixed with 'detected').

example data:

library(data.table)

set.seed(123)

#first column
colref <- 1:20

#number of columns will vary in the dt, so here we generate a random number 
n<-sample(5:10,1)

# function to create some logical vectors for the dt
create.col<-function(n){replicate(n,sample(c(TRUE,FALSE),20,replace=TRUE,prob = c(0.2,0.8)),simplify=FALSE)}

# create the dt
dt<-setDT(create.col(n))
# affix "detected" to columns of interest
colnames(dt) <- paste("detected", colnames(dt), sep = "_")
dt<-data.table(colref,dt)
dt

# need to create logical vector 'result' column identifying rows where TRUE is present in columns separated by three or more positions in the dt, with the 'detected' suffix

CodePudding user response：

You could use melt and rle:

melt(dt,id.vars="colref")[,.(detect=with(rle(value), 
                                         any(!values & lengths>=3)&sum(values)>1))                           
                          ,by=colref]

    colref   detect
     <int>   <lgcl>
 1:      1     TRUE
 2:      2     TRUE
 3:      3    FALSE
 4:      4    FALSE
 5:      5    FALSE
 6:      6    FALSE
 7:      7    FALSE
 8:      8     TRUE
 9:      9     TRUE
10:     10    FALSE
11:     11    FALSE
12:     12    FALSE
13:     13    FALSE
14:     14    FALSE
15:     15    FALSE
16:     16     TRUE
17:     17    FALSE
18:     18    FALSE
19:     19    FALSE
20:     20    FALSE
    colref detected

CodePudding user response：

library(tidyverse)

check_three_apart <- function(x) {
  # for logical vector x, check if any TRUEs spaced at least 3 indices away
  spacing <- which(x) %>% diff()
  if (is_empty(spacing)) {
    return(FALSE)
  } else {
    return(any(spacing >= 3))
  }
}

dt <- dt %>%
  mutate(three_apart = select(., starts_with("detected")) %>% apply(1, check_three_apart))
print(dt)

CodePudding user response：

I get a different result than the selected answer. Perhaps I'm misunderstanding something in the OP's request.

unique(melt(dt, id="colref",measure=patterns("detected"),value.name = "result") %>% 
            .[, id:=1:.N, by=.(colref)] %>% 
            .[result==T] %>% 
            .[order(colref,variable), diff:=id-shift(id), by=.(colref)] %>% 
            .[diff>=3,.(colref,result)])[dt, on="colref"]

Output:

    colref result detected_V1 detected_V2 detected_V3 detected_V4 detected_V5 detected_V6 detected_V7
     <int> <lgcl>      <lgcl>      <lgcl>      <lgcl>      <lgcl>      <lgcl>      <lgcl>      <lgcl>
 1:      1   TRUE        TRUE        TRUE       FALSE       FALSE       FALSE        TRUE       FALSE
 2:      2   TRUE        TRUE       FALSE       FALSE        TRUE       FALSE       FALSE       FALSE
 3:      3     NA       FALSE       FALSE       FALSE       FALSE       FALSE        TRUE        TRUE
 4:      4     NA       FALSE       FALSE       FALSE        TRUE        TRUE        TRUE       FALSE
 5:      5   TRUE        TRUE       FALSE       FALSE        TRUE        TRUE       FALSE       FALSE
 6:      6     NA       FALSE       FALSE       FALSE       FALSE        TRUE       FALSE       FALSE
 7:      7     NA       FALSE       FALSE        TRUE       FALSE       FALSE       FALSE       FALSE
 8:      8   TRUE        TRUE        TRUE       FALSE       FALSE       FALSE        TRUE       FALSE
 9:      9   TRUE       FALSE        TRUE       FALSE       FALSE       FALSE       FALSE        TRUE
10:     10     NA       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE
11:     11     NA       FALSE       FALSE       FALSE       FALSE       FALSE        TRUE       FALSE
12:     12     NA       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE
13:     13     NA        TRUE       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE
14:     14     NA       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE        TRUE
15:     15     NA       FALSE       FALSE       FALSE       FALSE       FALSE        TRUE       FALSE
16:     16   TRUE       FALSE       FALSE        TRUE       FALSE       FALSE       FALSE        TRUE
17:     17     NA        TRUE       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE
18:     18     NA        TRUE       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE
19:     19     NA       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE
20:     20     NA       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE       FALSE