Keep specific rows among two duplicates-CodePudding

I have a files with multiple duplicated pairs. I want to keep one of the duplicates based on a value in another column. In my example I want to keep the duplicate that has "Present" in the column "Detected".

structure(list(variable = c("CPCT02030264T", "DRUP01090004T", 
"DRUP01160001T", "DRUP01090004T", "DRUP01160001T"), value = c(0, 
0, 0, 22.2109790032886, 23.0078421452062), Detected = c("Absent", 
"Absent", "Absent", "Present", "Present"), Method = c("DNA", 
"DNA", "DNA", "RNA", "RNA")), class = "data.frame", row.names = c(39529L, 
145909L, 146881L, 304365L, 305283L))

    variable    value Detected Method
39529  CPCT02030264T  0.00000   Absent    DNA
145909 DRUP01090004T  0.00000   Absent    DNA
146881 DRUP01160001T  0.00000   Absent    DNA
304365 DRUP01090004T 22.21098  Present    RNA
305283 DRUP01160001T 23.00784  Present    RNA

out

            variable    value Detected Method
39529  CPCT02030264T  0.00000   Absent    DNA
304365 DRUP01090004T 22.21098  Present    RNA
305283 DRUP01160001T 23.00784  Present    RNA

CodePudding user response：

You can use slice_max by group to keep only the maximum value per group depending on the value of Detected. In that sense the maximum of Detected is the latest by alphabetical order:

library(dplyr)
df %>% 
  group_by(variable) %>% 
  slice_max(Detected)

# A tibble: 3 × 4
# Groups:   variable [3]
  variable      value Detected Method
  <chr>         <dbl> <chr>    <chr> 
1 CPCT02030264T   0   Absent   DNA   
2 DRUP01090004T  22.2 Present  RNA   
3 DRUP01160001T  23.0 Present  RNA

CodePudding user response：

dplyr::row_number() can rank the rows within each value of variable. If you sort by Detected descending, the "Present" values will appear on top of the "Absent". Then filter on the top row within each group (i.e., rn is 1).

ds |> 
  dplyr::group_by(variable) |> 
  dplyr::mutate(
    rn = dplyr::row_number(dplyr::desc(Detected)),
  ) |> 
  dplyr::filter(rn == 1L) |> 
  dplyr::ungroup()

Then I like executing dplyr::ungroup() so future operations apply to the entire dataset, instead of treating them as subsets.

Output:

# A tibble: 3 × 5
  variable      value Detected Method    rn
  <chr>         <dbl> <chr>    <chr>  <int>
1 CPCT02030264T   0   Absent   DNA        1
2 DRUP01090004T  22.2 Present  RNA        1
3 DRUP01160001T  23.0 Present  RNA        1