I have been trying to write a for loop in order to determine whether my time data falls within specific time ranges. I have gone through all related questions on stack overflow and so far this is where I have gotten:
Basically, I have one data frame with acoustic measures of vowels. For each vowel, I also have the time in seconds at which the participants uttered the vowel.
Then I have a second dataframe including time intervals. Those intervals correspond to time periods where the participant was talking and there was no overlapping noise. Those intervals therefore identify the vowels from my first dataframe that can be used in subsequent analyses because their acoustic measures are not contaminated by other noises
I need to create a new column ("target") in data frame 1 that indicates, for each participant and for each recording, whether YES or NO the vowel falls into one of the intervals from data frame 2.
these are the variables of interest in data frame 1:
Participant RecordingNumber time
1 FSO110 1 37.258
2 FSO110 1 37.432
3 FSO110 1 37.496
4 FSO110 1 38.138
5 FSO110 1 38.499
6 FSO110 1 42.124
7 FSO110 1 61.733
8 FSO110 1 61.924
9 FSO110 1 61.980
10 FSO110 1 62.260
11 FSO110 1 62.610
12 FSO110 1 62.943
13 FSO110 1 194.929
14 FSO110 1 195.403
15 FSO110 1 401.114
16 FSO110 1 401.341
these are the variables of interest in data frame 2:
Participant RecordingNumber tmin tmax
FSO110 1 445.695 447.250
FSO110 1 448.444 449.093
FSO110 1 452.990 453.292
FSO110 1 481.177 481.709
FSO110 2 41.202 41.511
FSO110 2 42.176 43.132
FSO110 2 44.640 47.710
FSO110 2 53.819 56.253
FSO110 2 113.453 114.803
FSO110 2 123.135 123.374
So far, I have gotten there:
# split dataframes by Participant and Recording Number
data1 <- split(data1, paste0(data1$Participant, data1$RecordingNumber))
data2 <- split(data2, paste0(data2$Participant, data2$RecordingNumber))
# loop through each element of each splitted df
for (n in seq_along(data1)){
for (m in seq_along(data2)){
if(n == m){
data_split[[n]][["target"]] = as.character(lapply(data1[[n]][["time"]], FUN = function(x){
for (i in 1:nrow(data2[[m]])){
if(data2[[m]][["tmin"]]<=x & x<= data2[[m]][["tmax"]]){
return(paste0("in"))}
else{
return(paste0("overlap"))}
}
}
))}
}
The function seems to work. However, it only works for i == 1 (rows of data2). Therefore, it correctly identifies time points from data 1 that fall into the first interval of each splitted element of data 2 but does not continue for other intervals.
Solutions I have tried:
- use ifelse instead of if statement
for (n in seq_along(data1)){
for (m in seq_along(data2)){
if (n == m){
data1[[n]][["target"]] = as.character(lapply(data1[[n]][["time"]], FUN = function(x){
for (i in 1:nrow(data2[[m]])){
ifelse((data2[[m]][["tmin"]]<=x & x<= data2[[m]][["tmax"]]), "in", "overlap")
}
}
))}}
}
However, this function returns NULL for each row of my new "target column".
- adding any() to my if statement:
for (n in seq_along(data_split)){
for (m in seq_along(data_split_target)){
if(n == m) {
data_split[[n]][["target"]] = as.character(lapply(data_split[[n]][["time"]], FUN = function(x){
for (i in 1:nrow(data_split_target[[m]])){
if(any(data_split_target[[m]][["tmin"]])<=x & any(x<= data_split_target[[m]][["tmax"]])){
return(paste0("in"))}
else{
return(paste0("overlap"))}
}
}
))}
}
Again, the function seems to work as it correctly creates a new "target" column with "in" and "overlap" rows but the function erroenously returns "in" row values even when the time point did not fall into one of the intervals.
Can someone help me? Many thanks!
CodePudding user response:
Here is a base R way using split/Map
.
The data sets are split and then Map
applies function f
to each sub-df.
meas_split <- split(measures, list(measures$Participant, measures$RecordingNumber))
int_split <- split(intervals, list(intervals$Participant, intervals$RecordingNumber))
nms <- intersect(names(meas_split), names(int_split))
i <- match(names(meas_split[nms]), names(int_split[nms]))
j <- match(names(int_split[nms]), names(meas_split[nms]))
f <- function(X, Y){
yes <- sapply(X[["time"]], \(x){
x > Y[["tmin"]] & x < Y[["tmax"]]
})
as.integer(colSums(yes) > 0)
}
measures$target <- unlist(Map(f, meas_split[i], int_split[j]))
CodePudding user response:
Solved it using sqldf package.
result_all = sqldf("select * from data1
left join data2
on data1.rec = data2.rec
and data1.time between data2.tmin and data2.tmax")
Where $rec is a grouping variable identifying both Participant and Recording Number.