Identifying and summarizing observations of "A" within 5 minutes of observations of "-CodePudding

I have a problem that I just cant figure out how to solve efficiently. I want to identify all observations of sensordata of type ("A") that are within - 5 minutes of an observation of sensordata of type "B", and do some summarizing of the identified observations. This has to be done for all observations of "B" per each individual in the data. The dataset is large, so my for-loop solution is very slow.

I will try to explain it via an example below:

I have timestamped sensordata (two types "A" & "B") from different individuals.

Individual <- c(rep("Anna",7),rep("Mark",8))
datetime <- as.POSIXct(c("2021-11-18 07:02:12","2021-11-18 07:10:25","2021-11-18 07:22:18","2021-11-18 07:24:04","2021-11-18 07:27:43","2021-11-18 07:29:01","2021-11-18 07:50:01","2021-11-21 12:19:28","2021-11-21 10:55:58","2021-11-21 11:14:22","2021-11-21 11:18:05","2021-11-21 11:22:01","2021-11-21 11:26:09","2021-11-21 11:27:38","2021-11-21 12:15:50"), format = "%Y-%m-%d %H:%M:%S")
datatype <- as.factor(c("A","A","A","B","A","A","A","A","A","B","A","B","A","A","A")) #type of sensordata
value <- c(7.85,6.54,7.82,5.43,7.34,8.93,2.54,5.62,7.84,5.56,8.95,6.85,5.88,4.95,8.46) #value of the sensor

df<-data.frame(Individual, datetime, datatype, value)

What I try to do is that for each observation of "B" in column "datatype"", I want to identify all observations of "A" that are within - 5 minutes of the "B" observation. I then want to calcutae the mean value of these identified "A" observations, as well as the number of "A" observations. Finally I want to add this information to the "B" observation. All this has to be done for each observation of "B" in the data, and has to be done per individual in the data.

Ive constructed this really convoluted for-loop that will do the trick, but its extremely slow, and as the original data is several hundred thousand rows its not really feasible to run (take hours...).

library(dplyr)
library(plyr)

df_A<-df %>% filter(datatype == "A")
df_B<-df %>% filter(datatype == "B")

df_list <- list() 
name_list <- list() 
ind<-unique(df_B$Individual)

for (i in 1:length(ind)) { # loop over all individuals in the data
  for (j in 1:nrow(df_B[df_B$Individual==paste(ind[i]),])) { #loop over each observation in the data per individual
    row_number<-which(near(df_B[df_B$Individual==paste(ind[i]),]$datetime[j], 
                      df_A[df_A$Individual==paste(ind[i]),]$datetime, tol=5)) #find observations of "A" within  - 5 min an observation of "B"
    #summarize results in a dataframe format
    df1 <- df_B[df_B$Individual==paste(ind[i]),][j,]
    mean_for_A <- mean(df_A[df_A$Individual==paste(ind[i]),][row_number,]$value) # calc. mean of "A"
    number_of_A <- length(df_A[df_A$Individual==paste(ind[i]),][row_number,]$value) #calc. number of "A"
    df2<-cbind(df1, mean_for_A, number_of_A) #combine into dataframe 

    #store dataframe in list
    df_list[[j]]<-df2 #store dataframe of observations of "A" for each "B"
    name_list[[i]]<-ldply(df_list) #store completed dataframes per individual
    
  }
}

ldply(name_list) #final product as I want it to be

Is there any way to do this more efficiently?

Thank you!

CodePudding user response：

here is a data.table approach.. should run faster than the for-loop

library(data.table)
# set to data.table format
setDT(df)
# create a unique key (can be removend at the end)
df[, id := .I]
setkey(df, id)
# selfjoin the subset of df$datatype == B to df where datatype == A and the 
#  datetime is within  /- 300 seconds
df[datatype == "B", ][df[datatype == "B", ],c("mean_for_A", "number_of_A") := {
  temp <- df[Individual == i.Individual & datatype == "A" &
               datetime >= (i.datetime - 300) & datetime <= (i.datetime   300), ]
  list(mean(temp$value), nrow(temp))
}, by = .EACHI][, id := NULL][]

   Individual            datetime datatype value mean_for_A number_of_A
1:       Anna 2021-11-18 07:24:04        B  5.43      8.030           3
2:       Mark 2021-11-21 11:14:22        B  5.56      8.950           1
3:       Mark 2021-11-21 11:22:01        B  6.85      7.415           2

and this is another data.table approach

library(data.table)
# set to data.table format
setDT(df)
# split to list, by datatype-column
L <- split(df, by = "datatype")
# left cartesian join
ans <- L$A[L$B, on = .(Individual), allow.cartesian = TRUE]
# remove <> 300 seconds
ans <- ans[abs(as.numeric(datetime) - as.numeric(i.datetime)) <= 300, ]
# summarise
ans[, .(datatype = i.datatype[1], value = i.value[1],
        mean_for_A = mean(value), number_of_A = .N), 
    by = .(Individual, dateime = i.datetime)]

   Individual             dateime datatype value mean_for_A number_of_A
1:       Anna 2021-11-18 07:24:04        B  5.43      8.030           3
2:       Mark 2021-11-21 11:14:22        B  5.56      8.950           1
3:       Mark 2021-11-21 11:22:01        B  6.85      7.415           2

It depends on your actual data.set which approach runs faster. But on an average system both solutions should not take more than an couple of seconds (max 1 minute) to run, asusming you have enough internal mamory to store the datasets in.

CodePudding user response：

The data.table package is likely much faster. For your example:

## create the index start and end times
datB = as.data.table(df[which(df$datatype=="B"),])
datB$start = datB$datetime - (5*60) #subtract 5min
datB$end = datB$datetime   (5*60) #add 5min

## create the reference times
datA = as.data.table(df[which(df$datatype=="A"),])
datA$datetime2 = datA$datetime #duplicate to have a start and end time

## set the keys
setkeyv(datA,c("Individual","datetime", "datetime2"))
setkeyv(datB,c("Individual","start", "end"))

## now find the overlaps
overlaps = foverlaps(datA, datB, type="any", nomatch=0L)
 
## and summarise by the value (you may want to consider a uID if this value could be the same)
### get the mean times
meanTimes = aggregate(overlaps$i.value ~ paste0(overlaps$Individual,"_",overlaps$datetime,"_",overlaps$value), FUN=mean)
### get the counts
countTimes = aggregate(overlaps$i.value ~ paste0(overlaps$Individual,"_",overlaps$datetime,"_",overlaps$value), FUN=length)

## Put in requested output format
merged = merge(meanTimes, countTimes, by=1)
out =as.data.frame( t(sapply(strsplit(merged[,1], "_"), paste) ))
names(out) = c("Individual", "datetime", "value")
out$meanTimes = merged[,2]
out$countTimes = merged[,3]

out

The output is:

  Individual            datetime value meanTimes countTimes
1       Anna 2021-11-18 07:24:04  5.43     8.030          3
2       Mark 2021-11-21 11:14:22  5.56     8.950          1
3       Mark 2021-11-21 11:22:01  6.85     7.415          2