Home > database >  Calculate deviations from group means for each column
Calculate deviations from group means for each column

Time:08-26

I am working on a big dataset with a lot of rows and columns. It has approximately the following structure:

myData <- structure(list(ID = c("a", "b", "c", "d", "e"),
group = c("x", "y", "x", "x", "y"),
var1 = c(0L, 0L, 2L, 0L, 0L), var2 = c(0, 
0, 1, 0, 1), var3 = c(1.032, 0, 0, 0, 0)), row.names = c(NA, 
5L), class = "data.frame")
> myData
  ID group var1 var2  var3
1  a     x    0    0 1.032
2  b     y    0    0 0.000
3  c     x    2    1 0.000
4  d     x    0    0 0.000
5  e     y    0    1 0.000

I need to create two variables for each variable varN: varN_mean, containing the mean of each group, and varN_dev, containing the deviation from the mean of the individual observation from each group. The final data should look approximately like this:

> myData
  ID group var1_mean  var1_dev var2_mean var2_dev var3_mean var3_dev
1  a     x     0.667   -0.667     0.333    -0.333     0.344    0.688
2  b     y         0        0       0.5      -0.5         0        0
3  c     x     0.667    1.333     0.333     0.667     0.344   -0.344
4  d     x     0.667   -0.667     0.333    -0.333     0.344   -0.344
5  e     y         0        0       0.5       0.5         0        0

The order of the columns is not important.

I got until here (with help from a colleague) with dplyr:

library(dplyr)

means_myData <- 
    myData %>%
      group_by(group) %>%
      summarise(
        ID,
        across(starts_with("var"),
          .fns=~mean(.x),
          .names="{col}_mean")
      ) %>%
      ungroup()

But now I am struggling to compute the deviations. I think I should something like this:

means_myData %>%
  full_join(myData, by = c("ID", "group")) %>%
  summarise(
    ID,
    across(matches("^var._mean$"), .fns-~.x - ...)
  )

but I am not sure how to match the columns with their respective means in the expression.

How can I create the variables automatically for the deviations and compute the deviations from the mean for each group?

Thanks!

CodePudding user response:

Use across's list option to compute both variables at once:

library(dplyr)
myData %>% 
  group_by(group) %>% 
  mutate(across(starts_with("var"), 
                list(mean = mean, dev = ~ .x - mean(.x)), 
                .names = "{.col}_{.fn}"),
         .keep = "unused")

  ID    group var1_mean var1_dev var2_mean var2_dev var3_mean var3_dev
  <chr> <chr>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>    <dbl>
1 a     x         0.667   -0.667     0.333   -0.333     0.344    0.688
2 b     y         0        0         0.5     -0.5       0        0    
3 c     x         0.667    1.33      0.333    0.667     0.344   -0.344
4 d     x         0.667   -0.667     0.333   -0.333     0.344   -0.344
5 e     y         0        0         0.5      0.5       0        0    
  • Related