Home > Back-end >  calculating and keeping in the dataset only mean values of specific rows in R
calculating and keeping in the dataset only mean values of specific rows in R

Time:10-01

i am working with a big dataset where I have some observations with more measures and I would like to change the dataset in the way that for the observations with more values I keep only the mean.

I am adding a very small replicable example to show what I would like to do:

df <- data.frame(Name = c("Jon", "Bill", "Maria", "Ben", "Ben", "Ben", "Tina", "Joe", "Joe", "Joe"), n = c(23, 41, 32, 58, 26, 32, 76,51,20,34))

Here, I would like to keep the same data frame but instead of having three values for Ben and Joe, I would like to keep only the mean value of the three observations... this is a small example but I have a big data frame, could it be done with a function that can be applied to a bigger dataset? Thanks a lot!

CodePudding user response:

You could first get the mean values of Ben and Joe using group_by and summarise and after that add the remaining names using bind_rows like this:

df <- data.frame(Name = c("Jon", "Bill", "Maria", "Ben", "Ben", "Ben", "Tina", "Joe", "Joe", "Joe"), n = c(23, 41, 32, 58, 26, 32, 76,51,20,34))

library(dplyr)
df %>%
  filter(Name %in% c("Ben", "Joe"))  %>%
  group_by(Name) %>%
  summarise(n = mean(n)) %>%
  bind_rows(df %>% filter(!(Name %in% c("Ben", "Joe"))))
#> # A tibble: 6 × 2
#>   Name      n
#>   <chr> <dbl>
#> 1 Ben    38.7
#> 2 Joe    35  
#> 3 Jon    23  
#> 4 Bill   41  
#> 5 Maria  32  
#> 6 Tina   76

Created on 2022-09-30 with reprex v2.0.2

CodePudding user response:

You can use ifelse in the summarize call. I'm going to rename your original n variable to n_val to avoid confusion with the call the function n() which counts the number of rows in each group.

library(dplyr)

df <- data.frame(
  Name = c("Jon", "Bill", "Maria", "Ben", "Ben", "Ben", "Tina", "Joe", "Joe", "Joe"), 
  n_val = c(23, 41, 32, 58, 26, 32, 76,51,20,34)
)

df |> 
  group_by(Name) |> 
  summarize(n_val = ifelse(n() > 1, mean(n_val), n_val))

This allows you to adjust the threshold without knowing a priori which names to filter on.

  • Related