So I have a data frame such as this, containing species names, IDs, markers, and size of DNA sequences:
Species | ID| marker| size
-----------------------------------------------------
Tilapia guineensis | 1| co1 | 400
Tilapia guineensis | 1| co2 | 300
Tilapia guineensis | 1| co2 | 700
Tilapia guineensis | 1| co2 | 900
Tilapia zillii | 2| co5 | 600
Tilapia zillii | 2| coi8| 200
Tilapia zillii | 2| coi8| 500
Eutrigla gurnardus | 5| co1 | 100
Eutrigla gurnardus | 5| co2 | 200
Sprattus sprattus | 6| co3 | 300
Sprattus sprattus | 6| co4 | 400
Sardinia pichardus | 7| co1 | 800
Sardinia pichardus | 7| co2 | 800
I would like to keep only one row for each ID, but I want to keep the row that has the highest value in the size column, regardless of the species and marker columns. If the rows have equal ID and equal size, I would like to just keep one randomly. My output would be:
Species | ID| marker| size
-----------------------------------------------------
Tilapia guineensis | 1| co2 | 900
Tilapia zillii | 2| co5 | 600
Eutrigla gurnardus | 5| co2 | 200
Sprattus sprattus | 6| co4 | 400
Sardinia pichardus | 7| co2 | 800
CodePudding user response:
A possible solution:
library(dplyr)
df %>%
group_by(ID) %>%
slice_max(size)
#> # A tibble: 4 × 4
#> # Groups: ID [4]
#> Species ID marker size
#> <chr> <int> <chr> <int>
#> 1 Tilapia guineensis 1 co2 900
#> 2 Tilapia zillii 2 co5 600
#> 3 Eutrigla gurnardus 5 co2 200
#> 4 Sprattus sprattus 6 co4 400
CodePudding user response:
data.table
option:
library(data.table)
setDT(df)[, .SD[which.max(size)], by=ID]