I want to create a new column containg means of specific columns. The selected columns should depend on the group.
group | first | second | third |
---|---|---|---|
0 | 3 | 2 | 4 |
0 | 0 | NA | 5 |
0 | 2 | 7 | 1 |
1 | 3 | 1 | 6 |
1 | 4 | 0 | NA |
1 | 2 | 3 | 3 |
0 | 5 | 5 | 0 |
0 | 6 | 2 | 2 |
1 | NA | 1 | 3 |
As an example: I want a mean column with the following conditions:
- if a row contains a "0" in group, the mean should be calculated from "first" and "second"
- if a row contains a "1" in group, the mean should be calculated from "first" and "third"
- if a cell contains NA it should be ignored
So the final dataframe should look something like this:
group | first | second | third | mean |
---|---|---|---|---|
0 | 3 | 2 | 4 | 2.5 |
0 | 0 | NA | 5 | 0 |
0 | 2 | 7 | 1 | 4.5 |
1 | 3 | 1 | 6 | 4.5 |
1 | 4 | 0 | NA | 4 |
1 | 2 | 3 | 3 | 2.5 |
0 | 5 | 5 | 0 | 5 |
0 | 6 | 2 | 2 | 4 |
1 | NA | 1 | 3 | 3 |
Since my dataframe contains over 50 variables (and a few thousand rows) and not just those I want the mean from I can't select specific columns by their column or row number (like c(2,5),). I was thinking about adding a condition that explains to R that it should calculate the mean from "first" and "second" only for those rows that have a "0" in group and then the same principle for group = 1. I have no idea how to combinde these conditions or how I can do this in several steps.
CodePudding user response:
library(tidyverse)
tribble(
~group, ~first, ~second, ~third,
0, 3, 2, 4,
0, 0, NA, 5,
0, 2, 7, 1,
1, 3, 1, 6,
1, 4, 0, NA,
1, 2, 3, 3,
0, 5, 5, 0,
0, 6, 2, 2,
1, NA, 1, 3
) |>
rowwise() |>
mutate(mean = if_else(group == 0, mean(c_across(c(first, second)), na.rm = TRUE),
mean(c_across(c(first, third)), na.rm = TRUE)))
#> # A tibble: 9 × 5
#> # Rowwise:
#> group first second third mean
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 3 2 4 2.5
#> 2 0 0 NA 5 0
#> 3 0 2 7 1 4.5
#> 4 1 3 1 6 4.5
#> 5 1 4 0 NA 4
#> 6 1 2 3 3 2.5
#> 7 0 5 5 0 5
#> 8 0 6 2 2 4
#> 9 1 NA 1 3 3
Created on 2022-06-08 by the reprex package (v2.0.1)
CodePudding user response:
One way to do this would be to pivot the data to long format and use case_when()
to add a weight variable of 0
(for values you want ignored) and 1
(for values you want included) according to your conditions. Use weighted.mean()
to calculate your mean and pivot back to wide.
library(tidyr)
library(dplyr)
df %>%
rowid_to_column() %>%
pivot_longer(-c(rowid, group)) %>%
mutate(weight = case_when(group == 0 & name == "third" ~ 0,
group == 1 & name == "second" ~ 0,
TRUE ~ 1)) %>%
group_by(rowid) %>%
mutate(mean = weighted.mean(value, weight, na.rm = TRUE)) %>%
pivot_wider(-weight) %>%
ungroup() %>%
relocate(mean, .after = last_col())
# A tibble: 9 × 6
rowid group first second third mean
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 3 2 4 2.5
2 2 0 0 NA 5 0
3 3 0 2 7 1 4.5
4 4 1 3 1 6 4.5
5 5 1 4 0 NA 4
6 6 1 2 3 3 2.5
7 7 0 5 5 0 5
8 8 0 6 2 2 4
9 9 1 NA 1 3 3
CodePudding user response:
If you have many groups and many columns, then I would recommend a more programmatic approach. You can define a code list code_ls
where you define which columns should be used for which group numbers. Then we can subset this with dplyr::cur_group()$group
and use it in an across
statement to select those columns and wrap that into rowMeans()
. Note that we use all_of()
inside across()
to select columns based on a character vector. Since your groups are numeric and we want to subset code_ls
by name we wrap cur_group()$group
into as.character
.
library(dplyr)
code_ls <- list(`0` = c("first", "second"),
`1` = c("first", "third"))
dat %>%
group_by(group) %>%
mutate(mean = rowMeans(across(
all_of(code_ls[[as.character(cur_group()$group)]])
), na.rm = TRUE))
#> # A tibble: 9 x 5
#> # Groups: group [2]
#> group first second third mean
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 3 2 4 2.5
#> 2 0 0 NA 5 0
#> 3 0 2 7 1 4.5
#> 4 1 3 1 6 4.5
#> 5 1 4 0 NA 4
#> 6 1 2 3 3 2.5
#> 7 0 5 5 0 5
#> 8 0 6 2 2 4
#> 9 1 NA 1 3 3
# the data
dat <- tribble(
~group, ~first, ~second, ~third,
0, 3, 2, 4,
0, 0, NA, 5,
0, 2, 7, 1,
1, 3, 1, 6,
1, 4, 0, NA,
1, 2, 3, 3,
0, 5, 5, 0,
0, 6, 2, 2,
1, NA, 1, 3
)
Created on 2022-06-08 by the reprex package (v2.0.1)