So I have a dataset with multiple duplicates, and I want to create a dataset that selects for the max value across multiple values. So for example:
ID Value1 Value2 Value3 Gender Race
1 45 76 87 M B
1 34 45 95 M B
2 67 100 92 F W
2 43 70 89 F W
3 34 95 80 F A
3 22 41 90 F A
4 78 25 7 M W
4 32 37 13 M W
5 56 105 25 M B
5 80 59 45 M B
Will become this:
ID Value1 Value2 Value3 Gender Race
1 45 76 95 M B
2 67 100 92 F W
3 34 95 90 F A
4 78 56 13 M W
5 80 105 45 M B
I have a feeling it has to do with the summarize command (although there are 40 value variables, so I fear writing a line of code for each variable) or some of the solutions provided here (which I don't know how to quite to modify for my needs): Remove duplicates keeping entry with largest absolute value
Any help appreciated!
CodePudding user response:
You can group by ID
, Gender
and Race
and summarise the Value
variables to get their max.
library(dplyr)
df %>%
group_by(ID, Gender, Race) %>%
summarise(across(starts_with('Value'), max, na.rm = TRUE), .groups = "drop")
# ID Gender Race Value1 Value2 Value3
# <int> <chr> <chr> <int> <int> <int>
#1 1 M B 45 76 95
#2 2 F W 67 100 92
#3 3 F A 34 95 90
#4 4 M W 78 37 13
#5 5 M B 80 105 45
CodePudding user response:
You can use aggregate
function as follows ,
df <- data.frame(ID = c(1,1,2,2,3,3,4,4,5,5) ,
Value1 = c(45,34,67,43,34,22,78,32,56,80) ,
Value2 = c(76,45,100,70,95,41,25,37,105,59) ,
Value3 = c(87,95,92,89,80,90,7,13,25,45) ,
Gender = c("M","M","F","F","F","F","M","M","M","M") ,
Race = c("B","B","W","W","A","A","W","W","B","B"))
aggregate(df , by = list(df$ID) , max)
#> Group.1 ID Value1 Value2 Value3 Gender Race
#> 1 1 1 45 76 95 M B
#> 2 2 2 67 100 92 F W
#> 3 3 3 34 95 90 F A
#> 4 4 4 78 37 13 M W
#> 5 5 5 80 105 45 M B
Created on 2022-05-30 by the reprex package (v2.0.1)