Identify intervals where a given vector of dates occurs-CodePudding

it's my first time posting here! I am super stuck on what I'm sure is an easy thing to do.

I have a dataframe of irregular intervals and a vector of dates. If one of the dates occurs within any of the given intervals, I would like a new column to flag this (as the intervals need to be deleted). Similar to this post but the solution doesn't work due to the irregular intervals. I have over 2000 intervals and 2000 dates.

I can get the dates that occur within the given intervals using the %within% function, but this is no good as I cant find which intervals the dates are occurring in.

I've tried the solutions in this similar post but I have no grouping variable and can't get them to work.

Any suggestions would be incredibly helpful!!! Thank you so much!!

Example raw data (not as lubridate intervals):

>df1
 diveno               start                 fin
1      1 2018-08-01 08:20:40 2018-08-01 08:39:20
2      2 2018-08-01 08:40:50 2018-08-01 08:53:40
3      3 2018-08-01 10:01:00 2018-08-01 10:16:30
4      4 2018-08-01 15:45:30 2018-08-01 15:58:20
5      5 2018-08-01 17:06:00 2018-08-01 17:18:20

>df2
                 date
1 2018-08-01 08:30:00
2 2018-08-01 15:47:00
3 2018-08-02 17:10:00

What I'd like
> df3
  diveno               start                 fin dateoccurs
1      1 2018-08-01 08:20:40 2018-08-01 08:39:20          Y
2      2 2018-08-01 08:40:50 2018-08-01 08:53:40          N
3      3 2018-08-01 10:01:00 2018-08-01 10:16:30          N
4      4 2018-08-01 15:45:30 2018-08-01 15:58:20          Y
5      5 2018-08-01 17:06:00 2018-08-01 17:18:20          N

Where the dateoccurs column flags if a date from df2 occurs in given intervals in df1

Code for example data:


df1<-data.frame(diveno=c(1,2,3,4,5), 
           start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
           fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"))

df1$start <- as.POSIXct(df1$start,format="%Y-%m-%d %H:%M:%S",tz="CET")
df1$fin <- as.POSIXct(df1$fin,format="%Y-%m-%d %H:%M:%S",tz="CET")


df2<-data.frame(date=c("2018-08-01 08:30:00", "2018-08-01 15:47:00", "2018-08-02 17:10:00"))
df2$date <- as.POSIXct(df2$date,format="%Y-%m-%d %H:%M:%S",tz="CET")

What I need:


df3<-data.frame(diveno=c(1,2,3,4,5), 
                start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
                fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"),
                dateoccurs=c("Y","N","N","Y","N"))

The closest I've gotten is using an answer from this post But it returns altered 'fin' times, and when applied to the real massive dataset seems to duplicate values and change the number of 'diveno'!

intervals<-df1
elements<-df2[,1]

library(data.table) #v1.10.0
j<-setDT(intervals)[data.table(elements), on = .(start <= elements, fin >= elements)]
j2<-as.data.frame(j)
na.omit(j2)

UPDATED sample data for df2 that seems to produce false positives?

> dput(df2) structure(list(date = structure(c(1533096000, 1533096300, 1533096600,  1533096900, 1533097200, 1533097500, 1533097800, 1533098100, 1533098400,  1533098700, 1533099000, 1533099300, 1533099600, 1533099900, 1533100200,  1533100500, 1533100800, 1533101100, 1533101400, 1533101700, 1533102000,  1533102300, 1533102600, 1533102900, 1533103200, 1533103500, 1533103800,  1533104100, 1533104400, 1533104700, 1533105000, 1533105300, 1533105600,  1533105900, 1533106200, 1533106500, 1533106800, 1533107100, 1533107400,  1533107700, 1533108000, 1533108300, 1533108600, 1533108900, 1533109200,  1533109500, 1533109800, 1533110100), tzone = "UTC", class = c("POSIXct",  "POSIXt")), depth = c(NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_)), class = "data.frame", row.names = c(NA, 
-48L))

Update 2 (sorry!) within df2 the following times:

30 2018-08-01 06:25:00    NA
31 2018-08-01 06:30:00    NA
32 2018-08-01 06:35:00    NA

Seem to be recognised as falling within the following intervals:

   diveno               start                 fin dateoccurs
1        1 2018-08-01 08:20:40 2018-08-01 08:39:20          Y
2        2 2018-08-01 08:40:50 2018-08-01 08:53:40          Y
3        3 2018-08-01 10:01:00 2018-08-01 10:16:30          N

Why might this be happening?

s<-df1[1,2] f<-df1[1,3]     int<-interval(s,f)

df2[,1] %within% ir

CodePudding user response：

You may use outer.

fun <- function(i, j) data.table::between(df2[i, 'date'], df1[j, 'start'], df1[j, 'fin'])
df1$occ <- colSums(outer(seq_len(nrow(df2)), seq_len(nrow(df1)), Vectorize(fun)))
df1
#   diveno               start                 fin occ
# 1      1 2018-08-01 08:20:40 2018-08-01 08:39:20   1
# 2      2 2018-08-01 08:40:50 2018-08-01 08:53:40   0
# 3      3 2018-08-01 10:01:00 2018-08-01 10:16:30   0
# 4      4 2018-08-01 15:45:30 2018-08-01 15:58:20   1
# 5      5 2018-08-01 17:06:00 2018-08-01 17:18:20   0

The binary column can easily be wrapped as factor if you like.

df1$occ <- colSums(outer(seq_len(nrow(df2)), seq_len(nrow(df1)), Vectorize(fun))) |> 
  factor(labels=c("N", "Y"))
df1
#   diveno               start                 fin occ
# 1      1 2018-08-01 08:20:40 2018-08-01 08:39:20   Y
# 2      2 2018-08-01 08:40:50 2018-08-01 08:53:40   N
# 3      3 2018-08-01 10:01:00 2018-08-01 10:16:30   N
# 4      4 2018-08-01 15:45:30 2018-08-01 15:58:20   Y
# 5      5 2018-08-01 17:06:00 2018-08-01 17:18:20   N

Data:

df1 <- structure(list(diveno = c(1, 2, 3, 4, 5), start = structure(c(1533104440, 
1533105650, 1533110460, 1533131130, 1533135960), class = c("POSIXct", 
"POSIXt"), tzone = "CET"), fin = structure(c(1533105560, 1533106420, 
1533111390, 1533131900, 1533136700), class = c("POSIXct", "POSIXt"
), tzone = "CET"), occ = structure(c(2L, 1L, 1L, 2L, 1L), levels = c("N", 
"Y"), class = "factor")), row.names = c(NA, -5L), class = "data.frame")

df2 <- structure(list(date = structure(c(1533105000, 1533131220, 1533222600
), class = c("POSIXct", "POSIXt"), tzone = "CET")), row.names = c(NA, 
-3L), class = "data.frame")

CodePudding user response：

data.table solution using a non-equi join

library(data.table)
setDT(df1); setDT(df2)
# initialise new column with "N"
df1[, dateoccurs := "N"]
# update join
df1[df2, dateoccurs := "Y", on = .(start <= date, fin >= date)][]
#    diveno               start                 fin dateoccurs
# 1:      1 2018-08-01 08:20:40 2018-08-01 08:39:20          Y
# 2:      2 2018-08-01 08:40:50 2018-08-01 08:53:40          N
# 3:      3 2018-08-01 10:01:00 2018-08-01 10:16:30          N
# 4:      4 2018-08-01 15:45:30 2018-08-01 15:58:20          Y
# 5:      5 2018-08-01 17:06:00 2018-08-01 17:18:20          N