i am working with a big dataset where I have some observations with more measures and I would like to change the dataset in the way that for the observations with more values I keep only the mean.
I am adding a very small replicable example to show what I would like to do:
df <- data.frame(Name = c("Jon", "Bill", "Maria", "Ben", "Ben", "Ben", "Tina", "Joe", "Joe", "Joe"), n = c(23, 41, 32, 58, 26, 32, 76,51,20,34))
Here, I would like to keep the same data frame but instead of having three values for Ben and Joe, I would like to keep only the mean value of the three observations... this is a small example but I have a big data frame, could it be done with a function that can be applied to a bigger dataset? Thanks a lot!
CodePudding user response:
You could first get the mean values of Ben and Joe using group_by
and summarise
and after that add the remaining names using bind_rows
like this:
df <- data.frame(Name = c("Jon", "Bill", "Maria", "Ben", "Ben", "Ben", "Tina", "Joe", "Joe", "Joe"), n = c(23, 41, 32, 58, 26, 32, 76,51,20,34))
library(dplyr)
df %>%
filter(Name %in% c("Ben", "Joe")) %>%
group_by(Name) %>%
summarise(n = mean(n)) %>%
bind_rows(df %>% filter(!(Name %in% c("Ben", "Joe"))))
#> # A tibble: 6 × 2
#> Name n
#> <chr> <dbl>
#> 1 Ben 38.7
#> 2 Joe 35
#> 3 Jon 23
#> 4 Bill 41
#> 5 Maria 32
#> 6 Tina 76
Created on 2022-09-30 with reprex v2.0.2
CodePudding user response:
You can use ifelse
in the summarize
call. I'm going to rename your original n
variable to n_val
to avoid confusion with the call the function n()
which counts the number of rows in each group.
library(dplyr)
df <- data.frame(
Name = c("Jon", "Bill", "Maria", "Ben", "Ben", "Ben", "Tina", "Joe", "Joe", "Joe"),
n_val = c(23, 41, 32, 58, 26, 32, 76,51,20,34)
)
df |>
group_by(Name) |>
summarize(n_val = ifelse(n() > 1, mean(n_val), n_val))
This allows you to adjust the threshold without knowing a priori which names to filter on.