How to extract minimum and maximum values based on conditions in R-CodePudding

I have a data frame with thousands of rows and I need to output the minimum and maximum values of sections of data that belong to the same group and class. What I need is to read the first start value, compare it to the previous value in the end column and if smaller, jump to the next row and so on until the starting value is larger than the previous end value, then output the minimum starting value and the maximun for that section. My data is already ordered by group-class-start-end.

df <- data.frame(group = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
  class = c("2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3"),
  start = c("23477018","23535465","23567386","24708741","24708741","24708741","48339885","87274","87274","127819","1832772","1832772","1832772","6733569","7005524","7005524","7644572","8095433","8095433","8095433"),
  end = c("47341413", "47341413", "47909872","42247834","47776347","47909872","53818713","3161655","3479466","3503792","3503792","4916249","5329014","8089225","12037894","13934484","12037894","12037894","13626119","13934484"))

The output that I want to achieve is:

  group     class   start     end     
1   1       2    23477018   47909872
2   1       2    48339885   53818713
3   1       3    87274      5329014
4   1       3    6733569    13934484

Any ideas on how to achieve this will be very much appreciated.

CodePudding user response：

I used data.table for this.
My approach was to first change start and end to integers or there will be ordering problems.
Find which rows meet the start > max(all prior ends), then use cumsum to give an increasing sub-group number.
Then it's just a simple min and max by sub-group.
There are no loops to make this as fast as possible.

library(data.table)
df <- data.frame(group = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
                 class = c("2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3"),
                 start = c("23477018","23535465","23567386","24708741","24708741","24708741","48339885","87274","87274","127819","1832772","1832772","1832772","6733569","7005524","7005524","7644572","8095433","8095433","8095433"),
                 end = c("47341413", "47341413", "47909872","42247834","47776347","47909872","53818713","3161655","3479466","3503792","3503792","4916249","5329014","8089225","12037894","13934484","12037894","12037894","13626119","13934484"))

setDT(df)
df[, c('start', 'end') := lapply(.SD, as.integer), .SDcols = c('start', 'end')]
df[, subgrp := cumsum(start > shift(cummax(.SD$end), fill = 0)), keyby = c('group', 'class')]
ans <- df[, .(start = min(start), end = max(end)), keyby = c('group', 'class', 'subgrp')]
ans[, subgrp := NULL][]

   group class    start      end
1:     1     2 23477018 47909872
2:     1     2 48339885 53818713
3:     1     3    87274  5329014
4:     1     3  6733569 13934484

CodePudding user response：

Here is a tidyverse solution:

library(tidyverse)
            
df <- data.frame(
  group = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"),
  class = c("2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3"),
  start = c("23477018","23535465","23567386","24708741","24708741","24708741","48339885","87274","87274","127819","1832772","1832772","1832772","6733569","7005524","7005524","7644572","8095433","8095433","8095433"),
  end = c("47341413", "47341413", "47909872","42247834","47776347","47909872","53818713","3161655","3479466","3503792","3503792","4916249","5329014","8089225","12037894","13934484","12037894","12037894","13626119","13934484"))
df %>% 
  group_by(group, class) %>% 
  mutate(
    start = as.integer(start),
    end = as.integer(end),
    end_lag = lag(end),
    larger_flag = case_when(start > end_lag & !is.na(end_lag) ~ 1, TRUE ~ 0),
    sub_group = cumsum(larger_flag)) %>% 
  group_by(group, class, sub_group) %>% 
  summarise(
    start = min(start),
    end = max(end),
    .groups = 'drop'
    ) %>% 
  select(-sub_group)
 # A tibble: 4 x 4
   group class    start      max
   <chr> <chr>    <int>    <int>
 1 1     2     23477018 47909872
 2 1     2     48339885 53818713
 3 1     3        87274  5329014
 4 1     3      6733569 13934484