Home > front end >  Remove rows based on multiple column conditions in R (duplicates and min/max values)
Remove rows based on multiple column conditions in R (duplicates and min/max values)

Time:09-08

I am trying to remove rows from a data frame based on multiple conditions from different columns.

My data frame looks something like this:

# reproducible example
df <- data.frame (id  = c("id1", "id1", "id1", "id1", "id1", "id2", "id2", "id2", "id2", "id2", "id2", "id2", "id2"),
                  key = c("a", "a", "a", "b", "b", "a", "a", "a", "a", "a", "a", "a"),
                  start = c(161, 161, 162, 353, 354, 823, 823, 824, 824, 1170, 1170, 1172),
                  end = c(205, 228, 166, 408, 408, 831, 837, 829, 830, "1176", "1194", "1194"))
id   key   start   end
id1  a     161     205
id1  a     161     228
id1  a     162     166
id1  b     353     408
id1  b     354     408
id2  a     823     831
id2  a     823     837
id2  a     824     829
id2  a     824     830
id2  a     1170    1176
id2  a     1170    1194
id2  a     1172    1194

I would like to keep the rows with the same id and key, but with the condition that the start value is greater or equal (>=) and/or the end value is less or equal (<=) compared to the duplicates.

So, at the end I would like to get one row per id and key with the longest range: minimum value for start and maximum value for end.

Thus, the final data frame would look like this:

id   key   start   end
id1  a     161     228
id1  b     353     408
id2  a     823     837
id2  a     1170    1194

Thanks!

CodePudding user response:

We could get the min/max after grouping by 'id' and 'key'. If there are multiple ranges, we may need to create another grouping based on the difference in the previous end and the the 'start' value

library(dplyr)
df %>%
   group_by(id, key) %>%
   mutate(grp = cumsum(lag(end, default = first(end)) - start < 0)) %>% 
   group_by(grp, .add = TRUE) %>% 
   summarise(start = min(start), end = max(end), .groups = 'drop') %>% 
   ungroup %>% 
   select(-grp)

-output

# A tibble: 4 × 4
  id    key   start   end
  <chr> <chr> <dbl> <dbl>
1 id1   a       161   228
2 id1   b       353   408
3 id2   a       823   837
4 id2   a      1170  1194
  • Related