I have a data frame like this one
df2 <- data.frame(chr=c("Chr1", "Chr1","Chr1","Chr1", "Chr1"), start=c(303259, 303259, 141256011, 143116722, 141256011), end=c(11385251, 10779165, 141618035, 156328057, 156328057), chr.2=c("Chr1", "Chr1","Chr1","Chr1", "Chr1"), start.2=c(303259, 303259, 141256011, 141256011, 143116722), end.2=c(10779165, 11385251, 156328057, 156328057, 156328057) )
The table looks like this:
chr start end chr.2 start.2 end.2
1 Chr1 303259 11385251 Chr1 303259 10779165
2 Chr1 303259 10779165 Chr1 303259 11385251
3 Chr1 141256011 141618035 Chr1 141256011 156328057
4 Chr1 143116722 156328057 Chr1 141256011 156328057
5 Chr1 141256011 156328057 Chr1 143116722 156328057
As you can see, in this example, row 1 and row 2 are duplicated but in inverse order. I would like to keep only one of those rows. The same happens for rows 4 and 5. Also, if just by chance there is any exactly duplicated row I would like to remove it too.
I would like to obtain something like this:
chr start end chr.2 start.2 end.2
1 Chr1 303259 11385251 Chr1 303259 10779165
3 Chr1 141256011 141618035 Chr1 141256011 156328057
4 Chr1 143116722 156328057 Chr1 141256011 156328057
Do you know how I could achieve this?
CodePudding user response:
Use purrr::map2()
to create list-columns containing sorted vectors of “starts” and “ends” for each rows, then use dplyr::distinct()
to remove duplicates:
library(purrr)
library(dplyr)
df2 %>%
mutate(
starts = map2(start, start.2, ~ sort(c(.x, .y))),
ends = map2(end, end.2, ~ sort(c(.x, .y)))
) %>%
distinct(starts, ends, .keep_all = TRUE) %>%
select(!starts:ends)
chr start end chr.2 start.2 end.2
1 Chr1 303259 11385251 Chr1 303259 10779165
2 Chr1 141256011 141618035 Chr1 141256011 156328057
3 Chr1 143116722 156328057 Chr1 141256011 156328057