Home > Mobile >  How to remove row by range condition in a column using R
How to remove row by range condition in a column using R

Time:06-12

I have several data frames which has the following format:

sequence range sequence_ID
K142442 283-423 58654
K142442 283-414 58322
K142442 192-342 36762
K123123 771-250 21456
K123123 771-250 76846
K123123 771-250 41234
K343232 320-642 82657
K343232 320-642 36245
K343232 1670-1521 25264

(showing the relevant columns for time issues)

As you notice, there are some ranges (range column) that are repeated along the same sequences (sequence column) as shown in the table. So I'm trying to remove those redundant rows by looking the same range starting number from left to right.

In brief, if for example the sequence K142442 has 3 representatives as shown in the table, but two of these representatives have the same initial range (283 in the table) I only want to keep the sequence with the longer range of those sequences in the table.

This is the example data frame:

df = data_frame(sequence = c(rep("K142442",3),rep("K123123",3),
             rep("343232",3)),
range = c("283-423","283-414","192-342","771-250","771-250",
          "771-250","320-642","320-642","1670-1521"),
sequence_ID = c(58645,58322,36762,21456,76846,41232,82657,36246,25264))

desired output:

sequence range sequence_ID
K142442 283-423 58654
K142442 192-342 36762
K123123 771-250 21456
K343232 320-642 82657
K343232 1670-1521 25264

Thanks for your time.

CodePudding user response:

We could extract the minimum value from the 'range' column, use that as grouping column to get the first value

library(dplyr)
df %>%
  group_by(sequence, min_range = readr::parse_number(range)) %>% 
  summarise(across(c(range, sequence_ID), first), .groups = 'drop') %>%
  select(-min_range)%>% 
  arrange(match(sequence, df$sequence), match(range, df$range))

-output

# A tibble: 5 × 3
  sequence range     sequence_ID
  <chr>    <chr>           <dbl>
1 K142442  283-423         58645
2 K142442  192-342         36762
3 K123123  771-250         21456
4 343232   320-642         82657
5 343232   1670-1521       25264

Or with distinct

library(tidyr)
df %>% 
 separate(range, into = c('min_range', 'max_range'), remove = FALSE) %>% 
 distinct(sequence, min_range, .keep_all = TRUE) %>% 
 select(-min_range, -max_range)

If the ranges are not ordered, then we can also do

df %>% 
  separate(range, into = c('min_range', 'max_range'), 
      remove = FALSE, convert = TRUE) %>% 
  group_by(sequence, min_range) %>%
  slice_max(n =1, order_by = max_range, with_ties = FALSE) %>% 
  ungroup%>% 
  select(names(df)) %>% 
  arrange(match(sequence, df$sequence), match(range, df$range))

-output

# A tibble: 5 × 3
  sequence range     sequence_ID
  <chr>    <chr>           <dbl>
1 K142442  283-423         58645
2 K142442  192-342         36762
3 K123123  771-250         21456
4 343232   320-642         82657
5 343232   1670-1521       25264

CodePudding user response:

To add onto @akrun's answer, one could also do:

df |> 
  tidyr::separate(col = "range", into = c("start", "end")) |> 
  dplyr::group_by(sequence, start) |> 
  dplyr::summarise(
    end = max(end), 
    sequence_ID = sequence_ID[which.max(end)], 
    .groups = "drop"
  ) |> 
  tidyr::unite(col = "range", start:end, sep = "-", remove = TRUE)
#> # A tibble: 5 × 3
#>   sequence range     sequence_ID
#>   <chr>    <chr>           <dbl>
#> 1 K123123  771-250         21456
#> 2 K142442  192-342         36762
#> 3 K142442  283-423         58645
#> 4 K343232  1670-1521       25264
#> 5 K343232  320-642         82657

CodePudding user response:

In base R :

df[c("min" , "max")] <- do.call(rbind , strsplit(df$range , "-"))
df$min <- as.integer(df$min)
ag <- aggregate(max ~ sequence   min , df , max)

sequence_ID <- list()
j <- 1
for (i in ag$max){
  sequence_ID[[j]] <- df$sequence_ID[which(i == df$max)[1]]
  j <- j   1
}

cbind(ag , sequence_ID = unlist(sequence_ID))

  • Related