I have several data frames which has the following format:
sequence | range | sequence_ID |
---|---|---|
K142442 | 283-423 | 58654 |
K142442 | 283-414 | 58322 |
K142442 | 192-342 | 36762 |
K123123 | 771-250 | 21456 |
K123123 | 771-250 | 76846 |
K123123 | 771-250 | 41234 |
K343232 | 320-642 | 82657 |
K343232 | 320-642 | 36245 |
K343232 | 1670-1521 | 25264 |
(showing the relevant columns for time issues)
As you notice, there are some ranges (range column) that are repeated along the same sequences (sequence column) as shown in the table. So I'm trying to remove those redundant rows by looking the same range starting number from left to right.
In brief, if for example the sequence K142442 has 3 representatives as shown in the table, but two of these representatives have the same initial range (283 in the table) I only want to keep the sequence with the longer range of those sequences in the table.
This is the example data frame:
df = data_frame(sequence = c(rep("K142442",3),rep("K123123",3),
rep("343232",3)),
range = c("283-423","283-414","192-342","771-250","771-250",
"771-250","320-642","320-642","1670-1521"),
sequence_ID = c(58645,58322,36762,21456,76846,41232,82657,36246,25264))
desired output:
sequence | range | sequence_ID |
---|---|---|
K142442 | 283-423 | 58654 |
K142442 | 192-342 | 36762 |
K123123 | 771-250 | 21456 |
K343232 | 320-642 | 82657 |
K343232 | 1670-1521 | 25264 |
Thanks for your time.
CodePudding user response:
We could extract the minimum value from the 'range' column, use that as grouping column to get the first
value
library(dplyr)
df %>%
group_by(sequence, min_range = readr::parse_number(range)) %>%
summarise(across(c(range, sequence_ID), first), .groups = 'drop') %>%
select(-min_range)%>%
arrange(match(sequence, df$sequence), match(range, df$range))
-output
# A tibble: 5 × 3
sequence range sequence_ID
<chr> <chr> <dbl>
1 K142442 283-423 58645
2 K142442 192-342 36762
3 K123123 771-250 21456
4 343232 320-642 82657
5 343232 1670-1521 25264
Or with distinct
library(tidyr)
df %>%
separate(range, into = c('min_range', 'max_range'), remove = FALSE) %>%
distinct(sequence, min_range, .keep_all = TRUE) %>%
select(-min_range, -max_range)
If the ranges are not ordered, then we can also do
df %>%
separate(range, into = c('min_range', 'max_range'),
remove = FALSE, convert = TRUE) %>%
group_by(sequence, min_range) %>%
slice_max(n =1, order_by = max_range, with_ties = FALSE) %>%
ungroup%>%
select(names(df)) %>%
arrange(match(sequence, df$sequence), match(range, df$range))
-output
# A tibble: 5 × 3
sequence range sequence_ID
<chr> <chr> <dbl>
1 K142442 283-423 58645
2 K142442 192-342 36762
3 K123123 771-250 21456
4 343232 320-642 82657
5 343232 1670-1521 25264
CodePudding user response:
To add onto @akrun's answer, one could also do:
df |>
tidyr::separate(col = "range", into = c("start", "end")) |>
dplyr::group_by(sequence, start) |>
dplyr::summarise(
end = max(end),
sequence_ID = sequence_ID[which.max(end)],
.groups = "drop"
) |>
tidyr::unite(col = "range", start:end, sep = "-", remove = TRUE)
#> # A tibble: 5 × 3
#> sequence range sequence_ID
#> <chr> <chr> <dbl>
#> 1 K123123 771-250 21456
#> 2 K142442 192-342 36762
#> 3 K142442 283-423 58645
#> 4 K343232 1670-1521 25264
#> 5 K343232 320-642 82657
CodePudding user response:
In base R
:
df[c("min" , "max")] <- do.call(rbind , strsplit(df$range , "-"))
df$min <- as.integer(df$min)
ag <- aggregate(max ~ sequence min , df , max)
sequence_ID <- list()
j <- 1
for (i in ag$max){
sequence_ID[[j]] <- df$sequence_ID[which(i == df$max)[1]]
j <- j 1
}
cbind(ag , sequence_ID = unlist(sequence_ID))