Choosing duplicates according to maximum (for multiple columns) in R-CodePudding

So I have a dataset with multiple duplicates, and I want to create a dataset that selects for the max value across multiple values. So for example:

  ID         Value1   Value2 Value3  Gender  Race
   1          45     76      87        M      B   
   1          34     45      95        M      B
   2          67     100     92        F      W
   2          43     70      89        F      W
   3          34     95      80        F      A
   3          22     41      90        F      A
   4          78     25       7        M      W
   4          32     37      13        M      W
   5          56     105     25        M      B
   5          80     59      45        M      B

Will become this:

  ID         Value1   Value2 Value3  Gender  Race
   1          45     76      95        M      B   
   2          67     100     92        F      W
   3          34     95      90        F      A
   4          78     56      13        M      W
   5          80     105     45        M      B

I have a feeling it has to do with the summarize command (although there are 40 value variables, so I fear writing a line of code for each variable) or some of the solutions provided here (which I don't know how to quite to modify for my needs): Remove duplicates keeping entry with largest absolute value

Any help appreciated!

CodePudding user response：

You can group by ID, Gender and Race and summarise the Value variables to get their max.

library(dplyr)

df %>%
  group_by(ID, Gender, Race) %>%
  summarise(across(starts_with('Value'), max, na.rm = TRUE), .groups = "drop")

#     ID Gender Race  Value1 Value2 Value3
#  <int> <chr>  <chr>  <int>  <int>  <int>
#1     1 M      B         45     76     95
#2     2 F      W         67    100     92
#3     3 F      A         34     95     90
#4     4 M      W         78     37     13
#5     5 M      B         80    105     45

CodePudding user response：

You can use aggregate function as follows ,

df <- data.frame(ID = c(1,1,2,2,3,3,4,4,5,5) , 
                 Value1 = c(45,34,67,43,34,22,78,32,56,80) , 
                 Value2 = c(76,45,100,70,95,41,25,37,105,59) ,
                 Value3 = c(87,95,92,89,80,90,7,13,25,45) ,
                 Gender = c("M","M","F","F","F","F","M","M","M","M") ,
                 Race = c("B","B","W","W","A","A","W","W","B","B"))

aggregate(df , by = list(df$ID) , max)
#>   Group.1 ID Value1 Value2 Value3 Gender Race
#> 1       1  1     45     76     95      M    B
#> 2       2  2     67    100     92      F    W
#> 3       3  3     34     95     90      F    A
#> 4       4  4     78     37     13      M    W
#> 5       5  5     80    105     45      M    B

^{Created on 2022-05-30 by the reprex package (v2.0.1)}