I have the following situation:
ID VAR HSP VAR2 CODE
00001 56 B 345 1
00001 52 B 306 1
00002 5 C 32 2
00002 22 A 31 2
00003 55 D 23 1
00003 5 E 87 1
What I need to do is to: for replicated IDs perform the mean of VAR and VAR2 (i.e. all the numeric variables. I have many other numeric columns). The mean should not be done for the column code. Moreover since I have different types of HSP for the final data frame I need to keep the first as it appears in the sorting. The same for the column code.
Desired output:
ID VAR HSP VAR2 CODE
00001 54 B 325.5 1
00002 13.5 C 31.5 2
00003 30 D 55 1
CodePudding user response:
For this specific case, you can summarise the IDs like this.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(VAR = mean(VAR), HSP = first(HSP), VAR2 = mean(VAR2), CODE = first(CODE))
For a more generalized solution, here's how you get the mean for all numerical columns and the first value for all non-numerical values.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(where(~ is.numeric(.x)), mean),
across(where(~ !is.numeric(.x)), first))