I am trying to calculate the number of days over which all objects in a group overlap with each member of the group. To do this I want to compare each row of one column in a group, to each other row in that column in the same group. However, I am unable to come up with a simple solution for this; most of my effort has been with the map variants from purrr. Aside from that I have gone down some nested loop (:-/), nested apply rabbit holes; but I suspect there is a very simple way to accomplish this comparison.
Essentially I want the sum of the intersect of each interval in a group to one row of the group.
Input data: (format with intervals)
ID Group year interval_obs
1 A 2020 2020-04-29 UTC--2020-05-19 UTC
2 A 2020 2020-05-04 UTC--2020-05-29 UTC
3 A 2020 2020-05-09 UTC--2020-05-24 UTC
4 A 2020 2020-04-24 UTC--2020-04-28 UTC
5 A 2020 2020-05-30 UTC--2020-06-03 UTC
6 B 2020 2019-12-31 UTC--2020-01-20 UTC
7 B 2020 2020-01-10 UTC--2020-01-30 UTC
8 B 2020 2020-01-20 UTC--2020-02-09 UTC
9 B 2020 2020-01-15 UTC--2020-02-04 UTC
Input data (more human readable?) - where each start/end is the Day of Year (doy)
ID Group Year start end
1 A 2020 120 140
2 A 2020 125 150
3 A 2020 130 145
4 A 2020 115 119
5 A 2020 151 155
6 B 2020 0 20
7 B 2020 10 30
8 B 2020 20 40
9 B 2020 15 35
Desired Results:
ID total_overlap
1 25
2 30
3 25
4 0
5 0
6 15
7 35
8 25
9 35
note the desired total overlap is in days, the sum of all days the 4 other observations in group A overlap. Group B with 4 records to indicate variable lengths.
example data for problem
data <- structure(list(
ID = 1:9,
group = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
year = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L),
start = c(120L, 125L, 130L, 115L, 151L, 0L, 10L, 20L, 15L),
end = c(140L, 150L, 145L, 119L, 155L, 20L, 30L, 40L, 35L)),
class = "data.frame",
row.names = c(NA, -9L))
data <- data %>%
group_by(group, year) %>% # real dataset has several combos - both vars left as reminder
mutate(across(c(start, end), ~ as_date(., origin = paste0(year-1, "-12-31")))) %>% #this year-1 term is due to leap years etc.
mutate(interval_obs = interval(ymd(start), ymd(end))) %>%
dplyr::select(-start, -end)
output <- data %>% map(.x = .$interval_obs, # this code at least runs
.f = ~{results = sum(as.numeric(intersect(.x, .y$interval_obs)))})
The little chunk above is one of many types of way's I have approached this (map2, map_df etc.), and while it does not work I imagine (...) a solution is in that ballpark. Note that my example output has two features: 1) units are converted to days, 2) the 'self intersection' is subtracted out. Do not worry about those features I have ways to do both of those, I just did not include those because they may obfuscate the problem. However if it helps...
mutate(self_intersection = as.numeric(intersect(interval_obs, interval_obs2))) %>%
mutate(results = results - self_intersection) %>%
mutate(total_overlap = as.numeric(results)/86400))
I have been trying to keep data in lubridate or another date format so that different temporal resolutions could be easily accommodated in the future (e.g. hours, minutes)
edit 2 - example of calculating overlap for Group A
(data reproduced here)
ID Group Year start end
1 A 2020 120 140
2 A 2020 125 150
3 A 2020 130 145
4 A 2020 115 119
5 A 2020 151 155
for Group # 1, numbers after 'comparison' refer to ID.
comparison 1 - 2. End1 - Start2 = 15 days
comparison 1 - 3. End1 - Start2 = 10 days
comparison 1 - 4. NO OVERLAP = 0 days
comparison 1 - 5. NO OVERLAP = 0 days
total_overlap 25 days
CodePudding user response:
Is this what you are looking for?
The total overlap in the third line is off from your desired output, but that may be a typo?
library(tidyverse)
library(lubridate)
data |>
group_by(group) |>
mutate(total_overlap = map_dbl(interval_obs,
\(x) x |>
intersect(interval_obs) |>
int_length() |>
sum(na.rm = T) - int_length(x)
) / 86400
)
#> # A tibble: 9 × 5
#> # Groups: group [2]
#> ID group year interval_obs total_overlap
#> <int> <chr> <int> <Interval> <dbl>
#> 1 1 A 2020 2020-04-29 UTC--2020-05-19 UTC 25
#> 2 2 A 2020 2020-05-04 UTC--2020-05-29 UTC 30
#> 3 3 A 2020 2020-05-09 UTC--2020-05-24 UTC 25
#> 4 4 A 2020 2020-04-24 UTC--2020-04-28 UTC 0
#> 5 5 A 2020 2020-05-30 UTC--2020-06-03 UTC 0
#> 6 6 B 2020 2019-12-31 UTC--2020-01-20 UTC 15
#> 7 7 B 2020 2020-01-10 UTC--2020-01-30 UTC 35
#> 8 8 B 2020 2020-01-20 UTC--2020-02-09 UTC 25
#> 9 9 B 2020 2020-01-15 UTC--2020-02-04 UTC 35