Home > Back-end >  Choosing duplicates according to maximum (for multiple columns) in R
Choosing duplicates according to maximum (for multiple columns) in R

Time:05-30

So I have a dataset with multiple duplicates, and I want to create a dataset that selects for the max value across multiple values. So for example:

  ID         Value1   Value2 Value3  Gender  Race
   1          45     76      87        M      B   
   1          34     45      95        M      B
   2          67     100     92        F      W
   2          43     70      89        F      W
   3          34     95      80        F      A
   3          22     41      90        F      A
   4          78     25       7        M      W
   4          32     37      13        M      W
   5          56     105     25        M      B
   5          80     59      45        M      B

Will become this:

  ID         Value1   Value2 Value3  Gender  Race
   1          45     76      95        M      B   
   2          67     100     92        F      W
   3          34     95      90        F      A
   4          78     56      13        M      W
   5          80     105     45        M      B

I have a feeling it has to do with the summarize command (although there are 40 value variables, so I fear writing a line of code for each variable) or some of the solutions provided here (which I don't know how to quite to modify for my needs): Remove duplicates keeping entry with largest absolute value

Any help appreciated!

CodePudding user response:

You can group by ID, Gender and Race and summarise the Value variables to get their max.

library(dplyr)

df %>%
  group_by(ID, Gender, Race) %>%
  summarise(across(starts_with('Value'), max, na.rm = TRUE), .groups = "drop")

#     ID Gender Race  Value1 Value2 Value3
#  <int> <chr>  <chr>  <int>  <int>  <int>
#1     1 M      B         45     76     95
#2     2 F      W         67    100     92
#3     3 F      A         34     95     90
#4     4 M      W         78     37     13
#5     5 M      B         80    105     45

CodePudding user response:

You can use aggregate function as follows ,

df <- data.frame(ID = c(1,1,2,2,3,3,4,4,5,5) , 
                 Value1 = c(45,34,67,43,34,22,78,32,56,80) , 
                 Value2 = c(76,45,100,70,95,41,25,37,105,59) ,
                 Value3 = c(87,95,92,89,80,90,7,13,25,45) ,
                 Gender = c("M","M","F","F","F","F","M","M","M","M") ,
                 Race = c("B","B","W","W","A","A","W","W","B","B"))

aggregate(df , by = list(df$ID) , max)
#>   Group.1 ID Value1 Value2 Value3 Gender Race
#> 1       1  1     45     76     95      M    B
#> 2       2  2     67    100     92      F    W
#> 3       3  3     34     95     90      F    A
#> 4       4  4     78     37     13      M    W
#> 5       5  5     80    105     45      M    B

Created on 2022-05-30 by the reprex package (v2.0.1)

  • Related