I am trying to remove rows from a data frame based on multiple conditions from different columns.
My data frame looks something like this:
# reproducible example
df <- data.frame (id = c("id1", "id1", "id1", "id1", "id1", "id2", "id2", "id2", "id2", "id2", "id2", "id2", "id2"),
key = c("a", "a", "a", "b", "b", "a", "a", "a", "a", "a", "a", "a"),
start = c(161, 161, 162, 353, 354, 823, 823, 824, 824, 1170, 1170, 1172),
end = c(205, 228, 166, 408, 408, 831, 837, 829, 830, "1176", "1194", "1194"))
id key start end
id1 a 161 205
id1 a 161 228
id1 a 162 166
id1 b 353 408
id1 b 354 408
id2 a 823 831
id2 a 823 837
id2 a 824 829
id2 a 824 830
id2 a 1170 1176
id2 a 1170 1194
id2 a 1172 1194
I would like to keep the rows with the same id
and key
, but with the condition that the start
value is greater or equal (>=) and/or the end
value is less or equal (<=) compared to the duplicates.
So, at the end I would like to get one row per id
and key
with the longest range: minimum value for start
and maximum value for end
.
Thus, the final data frame would look like this:
id key start end
id1 a 161 228
id1 b 353 408
id2 a 823 837
id2 a 1170 1194
Thanks!
CodePudding user response:
We could get the min/max
after grouping by 'id' and 'key'. If there are multiple ranges, we may need to create another grouping based on the difference in the previous end and the the 'start' value
library(dplyr)
df %>%
group_by(id, key) %>%
mutate(grp = cumsum(lag(end, default = first(end)) - start < 0)) %>%
group_by(grp, .add = TRUE) %>%
summarise(start = min(start), end = max(end), .groups = 'drop') %>%
ungroup %>%
select(-grp)
-output
# A tibble: 4 × 4
id key start end
<chr> <chr> <dbl> <dbl>
1 id1 a 161 228
2 id1 b 353 408
3 id2 a 823 837
4 id2 a 1170 1194