Home > Software engineering >  How to find the mean of multiple columns based on a second dataset?
How to find the mean of multiple columns based on a second dataset?

Time:03-24

Problem

I need to use a dictionary dataset to determine which columns from a different dataset I should calculate the mean.

Data

I will illustrate my case with the iris dataset (a dataset already in R).

I have two datasets:

  1. The actual data - like iris (however the col names is basically a1, a2, a3, a4...).
  2. The dictionary for the former indicates what each iris column indicates. The column feature in the dictionary_iris is the grouping variable, and it would also be part of the name for the new variables (e.g., the new variable would be called Sepal_mean, or Petal_mean).

Iris dataset

library(dplyr)

iris %>% as_tibble()

  # # A tibble: 150 x 5
  # Sepal.Length Sepal.Width Petal.Length Petal.Width Species
  # <dbl>       <dbl>        <dbl>       <dbl> <fct>  
  # 1          5.1         3.5          1.4         0.2 setosa 
  # 2          4.9         3            1.4         0.2 setosa 
  # 3          4.7         3.2          1.3         0.2 setosa 
  # 4          4.6         3.1          1.5         0.2 setosa 
  # 5          5           3.6          1.4         0.2 setosa 
  # 6          5.4         3.9          1.7         0.4 setosa 
  # 7          4.6         3.4          1.4         0.3 setosa 
  # 8          5           3.4          1.5         0.2 setosa 
  # 9          4.4         2.9          1.4         0.2 setosa 
  # 10          4.9         3.1          1.5         0.1 setosa 
  # # ... with 140 more rows

Dictionary Dataset

dictionary_iris <- tibble(variables = names(iris)) %>% 
  separate(variables, into = c("feature", "measure"), remove = FALSE)

dictionary_iris

# # A tibble: 5 x 3
# variables    feature measure
# <chr>        <chr>   <chr>  
# 1 Sepal.Length Sepal   Length 
# 2 Sepal.Width  Sepal   Width  
# 3 Petal.Length Petal   Length 
# 4 Petal.Width  Petal   Width  
# 5 Species      Species NA 

Expected Output

I know how to do that manually (see below), but I would like to automatize this process because I have a dataframe with more than 300 columns and want to take 23 different means across those columns.

library(dplyr)

iris %>% 
  rowwise() %>% 
  mutate(Sepal_mean = mean(c(Sepal.Length, Sepal.Width), na.rm = TRUE),
         Petal_mean = mean(c(Petal.Length, Petal.Width), na.rm = TRUE))


# # A tibble: 150 x 7
# # Rowwise: 
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal_mean Petal_mean
# <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>      <dbl>
# 1          5.1         3.5          1.4         0.2 setosa        4.3        0.8 
# 2          4.9         3            1.4         0.2 setosa        3.95       0.8 
# 3          4.7         3.2          1.3         0.2 setosa        3.95       0.75
# 4          4.6         3.1          1.5         0.2 setosa        3.85       0.85
# 5          5           3.6          1.4         0.2 setosa        4.3        0.8 
# 6          5.4         3.9          1.7         0.4 setosa        4.65       1.05
# 7          4.6         3.4          1.4         0.3 setosa        4          0.85
# 8          5           3.4          1.5         0.2 setosa        4.2        0.85
# 9          4.4         2.9          1.4         0.2 setosa        3.65       0.8 
# 10          4.9         3.1          1.5         0.1 setosa        4          0.8 
# # ... with 140 more rows

I have the impression that I can do that with dplyr::mutate() and dplyr::across() or with some dplyr::map() function. But I got quite lost.

CodePudding user response:

If the intention is to use 'feature' column as a grouping, then split the 'dictionary_iris' by the 'feature' column (removed the last row (-5) as it is not numeric column, loop over the list with imap, transmute to create the column in 'iris' with the rowMeans of those column name, and bind with the original data

library(dplyr)
library(purrr)
library(stringr)
out <- imap_dfc(split(dictionary_iris$variables[-5], 
        dictionary_iris$feature[-5]),
   ~ iris %>%
      transmute(!! str_c(.y, "_mean") := 
         rowMeans(across(all_of(.x)), na.rm = TRUE))) %>% 
    bind_cols(iris, .)

-output

> head(out)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal_mean Sepal_mean
1          5.1         3.5          1.4         0.2  setosa       0.80       4.30
2          4.9         3.0          1.4         0.2  setosa       0.80       3.95
3          4.7         3.2          1.3         0.2  setosa       0.75       3.95
4          4.6         3.1          1.5         0.2  setosa       0.85       3.85
5          5.0         3.6          1.4         0.2  setosa       0.80       4.30
6          5.4         3.9          1.7         0.4  setosa       1.05       4.65

CodePudding user response:

iris_means <- iris %>% 
  mutate(id = row_number()) %>% 
  pivot_longer(-c(id, Species)) %>% 
  mutate(name = gsub("\\..*", "", name)) %>% 
  group_by(id, name) %>% 
  summarise(val = mean(value)) %>% 
  ungroup %>%
  filter(name %in% !!dictionary_iris$feature) %>% 
  pivot_wider(-name, values_from = "val", names_glue = "{.name}_mean")

iris %>% 
  mutate(id = row_number()) %>% 
  left_join(iris_means) %>% 
  select(-id)


Joining, by = "id"
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species Petal_mean Sepal_mean
1            5.1         3.5          1.4         0.2     setosa       0.80       4.30
2            4.9         3.0          1.4         0.2     setosa       0.80       3.95
3            4.7         3.2          1.3         0.2     setosa       0.75       3.95
4            4.6         3.1          1.5         0.2     setosa       0.85       3.85
5            5.0         3.6          1.4         0.2     setosa       0.80       4.30
6            5.4         3.9          1.7         0.4     setosa       1.05       4.65
  • Related