I have a files with multiple duplicated pairs. I want to keep one of the duplicates based on a value in another column. In my example I want to keep the duplicate that has "Present" in the column "Detected".
structure(list(variable = c("CPCT02030264T", "DRUP01090004T",
"DRUP01160001T", "DRUP01090004T", "DRUP01160001T"), value = c(0,
0, 0, 22.2109790032886, 23.0078421452062), Detected = c("Absent",
"Absent", "Absent", "Present", "Present"), Method = c("DNA",
"DNA", "DNA", "RNA", "RNA")), class = "data.frame", row.names = c(39529L,
145909L, 146881L, 304365L, 305283L))
variable value Detected Method
39529 CPCT02030264T 0.00000 Absent DNA
145909 DRUP01090004T 0.00000 Absent DNA
146881 DRUP01160001T 0.00000 Absent DNA
304365 DRUP01090004T 22.21098 Present RNA
305283 DRUP01160001T 23.00784 Present RNA
out
variable value Detected Method
39529 CPCT02030264T 0.00000 Absent DNA
304365 DRUP01090004T 22.21098 Present RNA
305283 DRUP01160001T 23.00784 Present RNA
CodePudding user response:
You can use slice_max
by group to keep only the maximum value per group depending on the value of Detected
. In that sense the maximum of Detected
is the latest by alphabetical order:
library(dplyr)
df %>%
group_by(variable) %>%
slice_max(Detected)
# A tibble: 3 × 4
# Groups: variable [3]
variable value Detected Method
<chr> <dbl> <chr> <chr>
1 CPCT02030264T 0 Absent DNA
2 DRUP01090004T 22.2 Present RNA
3 DRUP01160001T 23.0 Present RNA
CodePudding user response:
dplyr::row_number() can rank the rows within each value of variable
. If you sort by Detected
descending, the "Present" values will appear on top of the "Absent". Then filter on the top row within each group (i.e., rn
is 1).
ds |>
dplyr::group_by(variable) |>
dplyr::mutate(
rn = dplyr::row_number(dplyr::desc(Detected)),
) |>
dplyr::filter(rn == 1L) |>
dplyr::ungroup()
Then I like executing dplyr::ungroup() so future operations apply to the entire dataset, instead of treating them as subsets.
Output:
# A tibble: 3 × 5
variable value Detected Method rn
<chr> <dbl> <chr> <chr> <int>
1 CPCT02030264T 0 Absent DNA 1
2 DRUP01090004T 22.2 Present RNA 1
3 DRUP01160001T 23.0 Present RNA 1