Problem
I need to use a dictionary dataset to determine which columns from a different dataset I should calculate the mean.
Data
I will illustrate my case with the iris
dataset (a dataset already in R).
I have two datasets:
- The actual data - like
iris
(however the col names is basically a1, a2, a3, a4...). - The dictionary for the former indicates what each
iris
column indicates. The columnfeature
in thedictionary_iris
is the grouping variable, and it would also be part of the name for the new variables (e.g., the new variable would be calledSepal_mean
, orPetal_mean
).
Iris dataset
library(dplyr)
iris %>% as_tibble()
# # A tibble: 150 x 5
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# <dbl> <dbl> <dbl> <dbl> <fct>
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
# 7 4.6 3.4 1.4 0.3 setosa
# 8 5 3.4 1.5 0.2 setosa
# 9 4.4 2.9 1.4 0.2 setosa
# 10 4.9 3.1 1.5 0.1 setosa
# # ... with 140 more rows
Dictionary Dataset
dictionary_iris <- tibble(variables = names(iris)) %>%
separate(variables, into = c("feature", "measure"), remove = FALSE)
dictionary_iris
# # A tibble: 5 x 3
# variables feature measure
# <chr> <chr> <chr>
# 1 Sepal.Length Sepal Length
# 2 Sepal.Width Sepal Width
# 3 Petal.Length Petal Length
# 4 Petal.Width Petal Width
# 5 Species Species NA
Expected Output
I know how to do that manually (see below), but I would like to automatize this process because I have a dataframe with more than 300 columns and want to take 23 different means across those columns.
library(dplyr)
iris %>%
rowwise() %>%
mutate(Sepal_mean = mean(c(Sepal.Length, Sepal.Width), na.rm = TRUE),
Petal_mean = mean(c(Petal.Length, Petal.Width), na.rm = TRUE))
# # A tibble: 150 x 7
# # Rowwise:
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_mean Petal_mean
# <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
# 1 5.1 3.5 1.4 0.2 setosa 4.3 0.8
# 2 4.9 3 1.4 0.2 setosa 3.95 0.8
# 3 4.7 3.2 1.3 0.2 setosa 3.95 0.75
# 4 4.6 3.1 1.5 0.2 setosa 3.85 0.85
# 5 5 3.6 1.4 0.2 setosa 4.3 0.8
# 6 5.4 3.9 1.7 0.4 setosa 4.65 1.05
# 7 4.6 3.4 1.4 0.3 setosa 4 0.85
# 8 5 3.4 1.5 0.2 setosa 4.2 0.85
# 9 4.4 2.9 1.4 0.2 setosa 3.65 0.8
# 10 4.9 3.1 1.5 0.1 setosa 4 0.8
# # ... with 140 more rows
I have the impression that I can do that with dplyr::mutate()
and dplyr::across()
or with some dplyr::map()
function. But I got quite lost.
CodePudding user response:
If the intention is to use 'feature' column as a grouping, then split
the 'dictionary_iris' by the 'feature' column (removed the last row (-5
) as it is not numeric column, loop over the list
with imap
, transmute
to create the column in 'iris' with the rowMeans
of those column name, and bind with the original data
library(dplyr)
library(purrr)
library(stringr)
out <- imap_dfc(split(dictionary_iris$variables[-5],
dictionary_iris$feature[-5]),
~ iris %>%
transmute(!! str_c(.y, "_mean") :=
rowMeans(across(all_of(.x)), na.rm = TRUE))) %>%
bind_cols(iris, .)
-output
> head(out)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal_mean Sepal_mean
1 5.1 3.5 1.4 0.2 setosa 0.80 4.30
2 4.9 3.0 1.4 0.2 setosa 0.80 3.95
3 4.7 3.2 1.3 0.2 setosa 0.75 3.95
4 4.6 3.1 1.5 0.2 setosa 0.85 3.85
5 5.0 3.6 1.4 0.2 setosa 0.80 4.30
6 5.4 3.9 1.7 0.4 setosa 1.05 4.65
CodePudding user response:
iris_means <- iris %>%
mutate(id = row_number()) %>%
pivot_longer(-c(id, Species)) %>%
mutate(name = gsub("\\..*", "", name)) %>%
group_by(id, name) %>%
summarise(val = mean(value)) %>%
ungroup %>%
filter(name %in% !!dictionary_iris$feature) %>%
pivot_wider(-name, values_from = "val", names_glue = "{.name}_mean")
iris %>%
mutate(id = row_number()) %>%
left_join(iris_means) %>%
select(-id)
Joining, by = "id"
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal_mean Sepal_mean
1 5.1 3.5 1.4 0.2 setosa 0.80 4.30
2 4.9 3.0 1.4 0.2 setosa 0.80 3.95
3 4.7 3.2 1.3 0.2 setosa 0.75 3.95
4 4.6 3.1 1.5 0.2 setosa 0.85 3.85
5 5.0 3.6 1.4 0.2 setosa 0.80 4.30
6 5.4 3.9 1.7 0.4 setosa 1.05 4.65