I would love to have your inputs about how to use the purrr
package for multiple columns. In specific, I want to do some basic operations to create the confidence interval (lower and upper) for each of the variables mass and birth_year by skin_color, from the starwars database.
What I have done so far is:
- Calculate the number of observations different to NA foreach of the columns of my interest (mass and birth_year) by skin_color.
pacman::p_load("tidyr","purrr")
data <- starwars
data_obs <-
data %>%
dplyr::select(mass,birth_year,skin_color) %>%
dplyr::group_by(skin_color)%>%
dplyr::summarise_all(funs(sum(!is.na(.))))%>%
dplyr::ungroup()
- I created a database that estimates the mean and standard deviation foreach variable of interest by skin_color.
data_stats <-
data %>%
dplyr::select(mass,birth_year,skin_color)%>%
dplyr::group_by(skin_color) %>%
dplyr::summarise_all(., list(mean,sd)
, na.rm=T
)%>%
dplyr::ungroup()
- I merged both databases and in that way I have the number of observations different from NA, the mean, and the sd foreach of the columns.
data_complete <-
dplyr::inner_join(data_obs,data_stats, by="skin_color")
From here, it is easy to estimate the standard error foreach variable manually by:
data_complete <-
dplyr::mutate(mass_se = mass_sd/sqrt(mass_n),
mass_ci_upper = mass_mean qt(1 - (0.05 / 2), mass_n - 1)*mass_se,
mass_ci_lower = mass_mean - qt(1 - (0.05 / 2), mass_n - 1)*mass_se)
However, since this is a lot of work for my real dataset (with more than 50 variables), I would like to use the purrr
package. I have tried by doing:
list_vectors <- list(data$mass,data$birth_year)
list_ready <- map(list_vectors,
~ data %>%
group_by(skin_color)%>%
dplyr::summarise_all(funs(sum(!is.na(.))))%>%
dplyr::summarise_all(., list(mean,sd), na.rm=T) %>%
dplyr::ungroup()%>%
dplyr::mutate(var_se=var_sd/sqrt(var_n)))
vector_1 <- list_ready[[1]]
But this doesn't work. Any help is really really appreciated! (I really want to use the purrr
package).
CodePudding user response:
Simplest way might be to a) put your calculation steps into a function processing a vector and returning a list of a tibble
with the values you need and b) passing this into across
instead (using iris as an example instead):
library(tidyverse)
mean_ci <- function(vars) {
vars <- vars[!is.na(vars)]
mn <- mean(vars)
se <- sd(vars) / sqrt(length(vars))
tibble(
mean = mn,
lower = mn - qt(1 - (0.05 / 2), length(vars) - 1) * se,
upper = mn qt(1 - (0.05 / 2), length(vars) - 1) * se
)
}
iris |>
group_by(Species) |>
summarise(across(where(is.numeric), mean_ci)) |>
unnest(where(is_tibble), names_sep = "_")
#> # A tibble: 3 × 13
#> Species Sepal.Len…¹ Sepal…² Sepal…³ Sepal…⁴ Sepal…⁵ Sepal…⁶ Petal…⁷ Petal…⁸
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 4.91 5.11 3.43 3.32 3.54 1.46 1.41
#> 2 versicolor 5.94 5.79 6.08 2.77 2.68 2.86 4.26 4.13
#> 3 virginica 6.59 6.41 6.77 2.97 2.88 3.07 5.55 5.40
#> # … with 4 more variables: Petal.Length_upper <dbl>, Petal.Width_mean <dbl>,
#> # Petal.Width_lower <dbl>, Petal.Width_upper <dbl>, and abbreviated variable
#> # names ¹Sepal.Length_mean, ²Sepal.Length_lower, ³Sepal.Length_upper,
#> # ⁴Sepal.Width_mean, ⁵Sepal.Width_lower, ⁶Sepal.Width_upper,
#> # ⁷Petal.Length_mean, ⁸Petal.Length_lower
A more purrr
-y way to do it might be to map
the function to nested dataframes to create a slightly longer-form data output:
iris |>
nest(data = -Species) |>
mutate(data = map(data, ~tibble(metric = names(.x), map_df(.x, mean_ci)))) |>
unnest(data)
#> # A tibble: 12 × 5
#> Species metric mean lower upper
#> <fct> <chr> <dbl> <dbl> <dbl>
#> 1 setosa Sepal.Length 5.01 4.91 5.11
#> 2 setosa Sepal.Width 3.43 3.32 3.54
#> 3 setosa Petal.Length 1.46 1.41 1.51
#> 4 setosa Petal.Width 0.246 0.216 0.276
#> 5 versicolor Sepal.Length 5.94 5.79 6.08
#> 6 versicolor Sepal.Width 2.77 2.68 2.86
#> 7 versicolor Petal.Length 4.26 4.13 4.39
#> 8 versicolor Petal.Width 1.33 1.27 1.38
#> 9 virginica Sepal.Length 6.59 6.41 6.77
#> 10 virginica Sepal.Width 2.97 2.88 3.07
#> 11 virginica Petal.Length 5.55 5.40 5.71
#> 12 virginica Petal.Width 2.03 1.95 2.10