I have a datatable / dataframe with some columns of logical vectors - each logical vector represents the presence or absence of a species in a weekly survey. I want to identify those rows where the species is present over a 21 day period or more - the period that the species isn't recorded in between is irrelevant).
This means I need to identify all those rows where a 'TRUE' is present in at least 2 columns (only those columns with "detected" in the colname in example below) that are 3 or more columns apart (because each column represents a weekly survey).
So if e.g. V1 and V4 are TRUE, my 'result' column is TRUE. If only V6,V7,V8 are TRUE, my result column is FALSE. If V2 and V9 are TRUE, my result is TRUE etc.
This is part of a bigger simulation, and the number of columns in the dt (and the number of weekly surveys) varies depending on other simulation parameters, so it isn't possible to do it using column indexing. But the columns of interest can all be given the same suffix ('detected') as in the example code below.
There are also other TRUE/FALSE columns in the dt (that aren't suffixed with 'detected').
example data:
library(data.table)
set.seed(123)
#first column
colref <- 1:20
#number of columns will vary in the dt, so here we generate a random number
n<-sample(5:10,1)
# function to create some logical vectors for the dt
create.col<-function(n){replicate(n,sample(c(TRUE,FALSE),20,replace=TRUE,prob = c(0.2,0.8)),simplify=FALSE)}
# create the dt
dt<-setDT(create.col(n))
# affix "detected" to columns of interest
colnames(dt) <- paste("detected", colnames(dt), sep = "_")
dt<-data.table(colref,dt)
dt
# need to create logical vector 'result' column identifying rows where TRUE is present in columns separated by three or more positions in the dt, with the 'detected' suffix
CodePudding user response:
You could use melt
and rle
:
melt(dt,id.vars="colref")[,.(detect=with(rle(value),
any(!values & lengths>=3)&sum(values)>1))
,by=colref]
colref detect
<int> <lgcl>
1: 1 TRUE
2: 2 TRUE
3: 3 FALSE
4: 4 FALSE
5: 5 FALSE
6: 6 FALSE
7: 7 FALSE
8: 8 TRUE
9: 9 TRUE
10: 10 FALSE
11: 11 FALSE
12: 12 FALSE
13: 13 FALSE
14: 14 FALSE
15: 15 FALSE
16: 16 TRUE
17: 17 FALSE
18: 18 FALSE
19: 19 FALSE
20: 20 FALSE
colref detected
CodePudding user response:
library(tidyverse)
check_three_apart <- function(x) {
# for logical vector x, check if any TRUEs spaced at least 3 indices away
spacing <- which(x) %>% diff()
if (is_empty(spacing)) {
return(FALSE)
} else {
return(any(spacing >= 3))
}
}
dt <- dt %>%
mutate(three_apart = select(., starts_with("detected")) %>% apply(1, check_three_apart))
print(dt)
CodePudding user response:
I get a different result than the selected answer. Perhaps I'm misunderstanding something in the OP's request.
unique(melt(dt, id="colref",measure=patterns("detected"),value.name = "result") %>%
.[, id:=1:.N, by=.(colref)] %>%
.[result==T] %>%
.[order(colref,variable), diff:=id-shift(id), by=.(colref)] %>%
.[diff>=3,.(colref,result)])[dt, on="colref"]
Output:
colref result detected_V1 detected_V2 detected_V3 detected_V4 detected_V5 detected_V6 detected_V7
<int> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
1: 1 TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
2: 2 TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
3: 3 NA FALSE FALSE FALSE FALSE FALSE TRUE TRUE
4: 4 NA FALSE FALSE FALSE TRUE TRUE TRUE FALSE
5: 5 TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
6: 6 NA FALSE FALSE FALSE FALSE TRUE FALSE FALSE
7: 7 NA FALSE FALSE TRUE FALSE FALSE FALSE FALSE
8: 8 TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
9: 9 TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
10: 10 NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
11: 11 NA FALSE FALSE FALSE FALSE FALSE TRUE FALSE
12: 12 NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
13: 13 NA TRUE FALSE FALSE FALSE FALSE FALSE FALSE
14: 14 NA FALSE FALSE FALSE FALSE FALSE FALSE TRUE
15: 15 NA FALSE FALSE FALSE FALSE FALSE TRUE FALSE
16: 16 TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
17: 17 NA TRUE FALSE FALSE FALSE FALSE FALSE FALSE
18: 18 NA TRUE FALSE FALSE FALSE FALSE FALSE FALSE
19: 19 NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE
20: 20 NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE