Home > database >  Create mean column for specific columns depending on group in R
Create mean column for specific columns depending on group in R

Time:06-09

I want to create a new column containg means of specific columns. The selected columns should depend on the group.

group first second third
0 3 2 4
0 0 NA 5
0 2 7 1
1 3 1 6
1 4 0 NA
1 2 3 3
0 5 5 0
0 6 2 2
1 NA 1 3

As an example: I want a mean column with the following conditions:

  • if a row contains a "0" in group, the mean should be calculated from "first" and "second"
  • if a row contains a "1" in group, the mean should be calculated from "first" and "third"
  • if a cell contains NA it should be ignored

So the final dataframe should look something like this:

group first second third mean
0 3 2 4 2.5
0 0 NA 5 0
0 2 7 1 4.5
1 3 1 6 4.5
1 4 0 NA 4
1 2 3 3 2.5
0 5 5 0 5
0 6 2 2 4
1 NA 1 3 3

Since my dataframe contains over 50 variables (and a few thousand rows) and not just those I want the mean from I can't select specific columns by their column or row number (like c(2,5),). I was thinking about adding a condition that explains to R that it should calculate the mean from "first" and "second" only for those rows that have a "0" in group and then the same principle for group = 1. I have no idea how to combinde these conditions or how I can do this in several steps.

CodePudding user response:

library(tidyverse)

tribble(
  ~group, ~first, ~second, ~third,
  0, 3, 2, 4,
  0, 0, NA, 5,
  0, 2, 7, 1,
  1, 3, 1, 6,
  1, 4, 0, NA,
  1, 2, 3, 3,
  0, 5, 5, 0,
  0, 6, 2, 2,
  1, NA, 1, 3
) |> 
  rowwise() |> 
  mutate(mean = if_else(group == 0, mean(c_across(c(first, second)), na.rm = TRUE), 
                        mean(c_across(c(first, third)), na.rm = TRUE)))

#> # A tibble: 9 × 5
#> # Rowwise: 
#>   group first second third  mean
#>   <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     0     3      2     4   2.5
#> 2     0     0     NA     5   0  
#> 3     0     2      7     1   4.5
#> 4     1     3      1     6   4.5
#> 5     1     4      0    NA   4  
#> 6     1     2      3     3   2.5
#> 7     0     5      5     0   5  
#> 8     0     6      2     2   4  
#> 9     1    NA      1     3   3

Created on 2022-06-08 by the reprex package (v2.0.1)

CodePudding user response:

One way to do this would be to pivot the data to long format and use case_when() to add a weight variable of 0 (for values you want ignored) and 1 (for values you want included) according to your conditions. Use weighted.mean() to calculate your mean and pivot back to wide.

library(tidyr)
library(dplyr)

df %>%
  rowid_to_column() %>%
  pivot_longer(-c(rowid, group)) %>%
  mutate(weight = case_when(group == 0 & name == "third" ~ 0,
                            group == 1 & name == "second" ~ 0,
                            TRUE ~ 1)) %>%
  group_by(rowid) %>%
  mutate(mean = weighted.mean(value, weight, na.rm = TRUE)) %>%
  pivot_wider(-weight) %>%
  ungroup() %>%
  relocate(mean, .after = last_col())

# A tibble: 9 × 6
  rowid group first second third  mean
  <int> <dbl> <dbl>  <dbl> <dbl> <dbl>
1     1     0     3      2     4   2.5
2     2     0     0     NA     5   0  
3     3     0     2      7     1   4.5
4     4     1     3      1     6   4.5
5     5     1     4      0    NA   4  
6     6     1     2      3     3   2.5
7     7     0     5      5     0   5  
8     8     0     6      2     2   4  
9     9     1    NA      1     3   3  

CodePudding user response:

If you have many groups and many columns, then I would recommend a more programmatic approach. You can define a code list code_ls where you define which columns should be used for which group numbers. Then we can subset this with dplyr::cur_group()$group and use it in an across statement to select those columns and wrap that into rowMeans(). Note that we use all_of() inside across() to select columns based on a character vector. Since your groups are numeric and we want to subset code_ls by name we wrap cur_group()$group into as.character.

library(dplyr)

code_ls <- list(`0` = c("first", "second"),
                `1` = c("first", "third"))

dat %>% 
  group_by(group) %>% 
  mutate(mean = rowMeans(across(
    all_of(code_ls[[as.character(cur_group()$group)]])
  ), na.rm = TRUE))

#> # A tibble: 9 x 5
#> # Groups:   group [2]
#>   group first second third  mean
#>   <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     0     3      2     4   2.5
#> 2     0     0     NA     5   0  
#> 3     0     2      7     1   4.5
#> 4     1     3      1     6   4.5
#> 5     1     4      0    NA   4  
#> 6     1     2      3     3   2.5
#> 7     0     5      5     0   5  
#> 8     0     6      2     2   4  
#> 9     1    NA      1     3   3

# the data
dat <- tribble(
  
  ~group, ~first, ~second, ~third,
  
   0,   3,  2,  4, 
   0,   0,  NA, 5, 
   0,   2,  7,  1, 
   1,   3,  1,  6, 
   1,   4,  0,  NA,  
   1,   2,  3,  3, 
   0,   5,  5,  0, 
   0,   6,  2,  2, 
   1,   NA,  1, 3
)

Created on 2022-06-08 by the reprex package (v2.0.1)

  • Related