How do I determine whether time-coded data falls into time range?-CodePudding

I have been trying to write a for loop in order to determine whether my time data falls within specific time ranges. I have gone through all related questions on stack overflow and so far this is where I have gotten:

Basically, I have one data frame with acoustic measures of vowels. For each vowel, I also have the time in seconds at which the participants uttered the vowel.

Then I have a second dataframe including time intervals. Those intervals correspond to time periods where the participant was talking and there was no overlapping noise. Those intervals therefore identify the vowels from my first dataframe that can be used in subsequent analyses because their acoustic measures are not contaminated by other noises

I need to create a new column ("target") in data frame 1 that indicates, for each participant and for each recording, whether YES or NO the vowel falls into one of the intervals from data frame 2.

these are the variables of interest in data frame 1:

    Participant RecordingNumber    time
1        FSO110               1  37.258
2        FSO110               1  37.432
3        FSO110               1  37.496
4        FSO110               1  38.138
5        FSO110               1  38.499
6        FSO110               1  42.124
7        FSO110               1  61.733
8        FSO110               1  61.924
9        FSO110               1  61.980
10       FSO110               1  62.260
11       FSO110               1  62.610
12       FSO110               1  62.943
13       FSO110               1 194.929
14       FSO110               1 195.403
15       FSO110               1 401.114
16       FSO110               1 401.341

these are the variables of interest in data frame 2:

Participant RecordingNumber    tmin    tmax 
FSO110       1                 445.695 447.250   
FSO110       1                 448.444 449.093   
FSO110       1                 452.990 453.292   
FSO110       1                 481.177 481.709   
FSO110       2                 41.202  41.511   
FSO110       2                 42.176  43.132   
FSO110       2                 44.640  47.710   
FSO110       2                 53.819  56.253   
FSO110       2                 113.453 114.803   
FSO110       2                 123.135 123.374

So far, I have gotten there:

# split dataframes by Participant and Recording Number
data1 <- split(data1, paste0(data1$Participant, data1$RecordingNumber))
data2 <- split(data2, paste0(data2$Participant, data2$RecordingNumber))

# loop through each element of each splitted df 
for (n in seq_along(data1)){
  for (m in seq_along(data2)){
    if(n == m){
    data_split[[n]][["target"]] = as.character(lapply(data1[[n]][["time"]], FUN = function(x){
      for (i in 1:nrow(data2[[m]])){
          if(data2[[m]][["tmin"]]<=x & x<= data2[[m]][["tmax"]]){
            return(paste0("in"))}
        else{
          return(paste0("overlap"))}
          }
      }
    ))}
}

The function seems to work. However, it only works for i == 1 (rows of data2). Therefore, it correctly identifies time points from data 1 that fall into the first interval of each splitted element of data 2 but does not continue for other intervals.

Solutions I have tried:

use ifelse instead of if statement

for (n in seq_along(data1)){
  for (m in seq_along(data2)){
    if (n == m){
      data1[[n]][["target"]] = as.character(lapply(data1[[n]][["time"]], FUN = function(x){
        for (i in 1:nrow(data2[[m]])){
          ifelse((data2[[m]][["tmin"]]<=x & x<= data2[[m]][["tmax"]]), "in", "overlap")
        }
      }
      ))}}
}

However, this function returns NULL for each row of my new "target column".

adding any() to my if statement:

for (n in seq_along(data_split)){
  for (m in seq_along(data_split_target)){
    if(n == m) {
    data_split[[n]][["target"]] = as.character(lapply(data_split[[n]][["time"]], FUN = function(x){
      for (i in 1:nrow(data_split_target[[m]])){
          if(any(data_split_target[[m]][["tmin"]])<=x & any(x<= data_split_target[[m]][["tmax"]])){
            return(paste0("in"))}
        else{
          return(paste0("overlap"))}
          }
      }
    ))}
}

Again, the function seems to work as it correctly creates a new "target" column with "in" and "overlap" rows but the function erroenously returns "in" row values even when the time point did not fall into one of the intervals.

Can someone help me? Many thanks!

CodePudding user response：

Here is a base R way using split/Map.
The data sets are split and then Map applies function f to each sub-df.

meas_split <- split(measures, list(measures$Participant, measures$RecordingNumber))
int_split <- split(intervals, list(intervals$Participant, intervals$RecordingNumber))

nms <- intersect(names(meas_split), names(int_split))
i <- match(names(meas_split[nms]), names(int_split[nms]))
j <- match(names(int_split[nms]), names(meas_split[nms]))

f <- function(X, Y){
  yes <- sapply(X[["time"]], \(x){
    x > Y[["tmin"]] & x < Y[["tmax"]]
  })
  as.integer(colSums(yes) > 0)
}

measures$target <- unlist(Map(f, meas_split[i], int_split[j]))

CodePudding user response：

Solved it using sqldf package.

result_all = sqldf("select * from data1
                left join data2
                on data1.rec = data2.rec
                and data1.time between data2.tmin and data2.tmax")

Where $rec is a grouping variable identifying both Participant and Recording Number.