Home > Net >  Compare one row of a column against all others in group
Compare one row of a column against all others in group

Time:03-21

I am trying to calculate the number of days over which all objects in a group overlap with each member of the group. To do this I want to compare each row of one column in a group, to each other row in that column in the same group. However, I am unable to come up with a simple solution for this; most of my effort has been with the map variants from purrr. Aside from that I have gone down some nested loop (:-/), nested apply rabbit holes; but I suspect there is a very simple way to accomplish this comparison.

Essentially I want the sum of the intersect of each interval in a group to one row of the group.

Input data: (format with intervals)

ID Group year  interval_obs  
1   A   2020 2020-04-29 UTC--2020-05-19 UTC  
2   A   2020 2020-05-04 UTC--2020-05-29 UTC  
3   A   2020 2020-05-09 UTC--2020-05-24 UTC  
4   A   2020 2020-04-24 UTC--2020-04-28 UTC  
5   A   2020 2020-05-30 UTC--2020-06-03 UTC  
6   B   2020 2019-12-31 UTC--2020-01-20 UTC  
7   B   2020 2020-01-10 UTC--2020-01-30 UTC  
8   B   2020 2020-01-20 UTC--2020-02-09 UTC  
9   B   2020 2020-01-15 UTC--2020-02-04 UTC

Input data (more human readable?) - where each start/end is the Day of Year (doy)

ID Group Year start end
1   A   2020  120  140
2   A   2020  125  150
3   A   2020  130  145
4   A   2020  115  119
5   A   2020  151  155
6   B   2020    0   20
7   B   2020   10   30
8   B   2020   20   40
9   B   2020   15   35 

Desired Results:

ID  total_overlap  
  1   25  
  2   30  
  3   25  
  4    0  
  5    0  
  6   15  
  7   35  
  8   25  
  9   35  

note the desired total overlap is in days, the sum of all days the 4 other observations in group A overlap. Group B with 4 records to indicate variable lengths.

example data for problem

data <- structure(list(
  ID = 1:9,
  group = c("A", "A", "A", "A", "A", "B", "B", "B", "B"), 
  year = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L,  2020L, 2020L, 2020L), 
  start = c(120L, 125L, 130L, 115L, 151L, 0L, 10L, 20L, 15L),
  end = c(140L, 150L, 145L, 119L, 155L, 20L,  30L, 40L, 35L)),
  class = "data.frame", 
  row.names = c(NA, -9L))

data <- data %>% 
  group_by(group, year) %>% # real dataset has several combos - both vars left as reminder
  mutate(across(c(start, end), ~ as_date(., origin = paste0(year-1, "-12-31")))) %>%  #this year-1 term is due to leap years etc.
  mutate(interval_obs = interval(ymd(start), ymd(end))) %>% 
  dplyr::select(-start, -end)

output <- data %>% map(.x = .$interval_obs, # this code at least runs
              .f = ~{results = sum(as.numeric(intersect(.x, .y$interval_obs)))})

The little chunk above is one of many types of way's I have approached this (map2, map_df etc.), and while it does not work I imagine (...) a solution is in that ballpark. Note that my example output has two features: 1) units are converted to days, 2) the 'self intersection' is subtracted out. Do not worry about those features I have ways to do both of those, I just did not include those because they may obfuscate the problem. However if it helps...

mutate(self_intersection = as.numeric(intersect(interval_obs, interval_obs2))) %>% 
mutate(results = results - self_intersection) %>% 
mutate(total_overlap = as.numeric(results)/86400)) 

I have been trying to keep data in lubridate or another date format so that different temporal resolutions could be easily accommodated in the future (e.g. hours, minutes)

edit 2 - example of calculating overlap for Group A

(data reproduced here)

ID Group Year start end
1   A   2020  120  140
2   A   2020  125  150
3   A   2020  130  145
4   A   2020  115  119
5   A   2020  151  155

for Group # 1, numbers after 'comparison' refer to ID.

comparison 1 - 2. End1 - Start2 = 15 days  
comparison 1 - 3. End1 - Start2 = 10 days  
comparison 1 - 4. NO OVERLAP    =  0 days  
comparison 1 - 5. NO OVERLAP    =  0 days  
total_overlap                     25 days  

CodePudding user response:

Is this what you are looking for?

The total overlap in the third line is off from your desired output, but that may be a typo?

library(tidyverse)
library(lubridate)

data |> 
  group_by(group) |> 
  mutate(total_overlap = map_dbl(interval_obs, 
                                 \(x) x |> 
                                   intersect(interval_obs) |> 
                                   int_length() |> 
                                   sum(na.rm = T) - int_length(x)
                                 ) / 86400
         )
#> # A tibble: 9 × 5
#> # Groups:   group [2]
#>      ID group  year interval_obs                   total_overlap
#>   <int> <chr> <int> <Interval>                             <dbl>
#> 1     1 A      2020 2020-04-29 UTC--2020-05-19 UTC            25
#> 2     2 A      2020 2020-05-04 UTC--2020-05-29 UTC            30
#> 3     3 A      2020 2020-05-09 UTC--2020-05-24 UTC            25
#> 4     4 A      2020 2020-04-24 UTC--2020-04-28 UTC             0
#> 5     5 A      2020 2020-05-30 UTC--2020-06-03 UTC             0
#> 6     6 B      2020 2019-12-31 UTC--2020-01-20 UTC            15
#> 7     7 B      2020 2020-01-10 UTC--2020-01-30 UTC            35
#> 8     8 B      2020 2020-01-20 UTC--2020-02-09 UTC            25
#> 9     9 B      2020 2020-01-15 UTC--2020-02-04 UTC            35
  • Related