Home > database >  How to create confidence intervals for multiple columns using purrr and dplyr packages in R?
How to create confidence intervals for multiple columns using purrr and dplyr packages in R?

Time:12-12

I would love to have your inputs about how to use the purrr package for multiple columns. In specific, I want to do some basic operations to create the confidence interval (lower and upper) for each of the variables mass and birth_year by skin_color, from the starwars database.

What I have done so far is:

  1. Calculate the number of observations different to NA foreach of the columns of my interest (mass and birth_year) by skin_color.
pacman::p_load("tidyr","purrr")

data <- starwars
data_obs <- 
  data %>%
  dplyr::select(mass,birth_year,skin_color) %>%
  dplyr::group_by(skin_color)%>%
  dplyr::summarise_all(funs(sum(!is.na(.))))%>%
  dplyr::ungroup()
  1. I created a database that estimates the mean and standard deviation foreach variable of interest by skin_color.
data_stats <- 
  data %>%
  dplyr::select(mass,birth_year,skin_color)%>%
  dplyr::group_by(skin_color) %>%
  dplyr::summarise_all(., list(mean,sd)
                       , na.rm=T
                       )%>%
  dplyr::ungroup()
  1. I merged both databases and in that way I have the number of observations different from NA, the mean, and the sd foreach of the columns.
data_complete <-
  dplyr::inner_join(data_obs,data_stats, by="skin_color")

From here, it is easy to estimate the standard error foreach variable manually by:

data_complete <-
  dplyr::mutate(mass_se = mass_sd/sqrt(mass_n), 
                mass_ci_upper = mass_mean   qt(1 - (0.05 / 2), mass_n - 1)*mass_se,
                mass_ci_lower = mass_mean - qt(1 - (0.05 / 2), mass_n - 1)*mass_se)

However, since this is a lot of work for my real dataset (with more than 50 variables), I would like to use the purrr package. I have tried by doing:

list_vectors <- list(data$mass,data$birth_year)
list_ready <- map(list_vectors, 
                  ~ data %>%
                    group_by(skin_color)%>%
                    dplyr::summarise_all(funs(sum(!is.na(.))))%>%
                    dplyr::summarise_all(., list(mean,sd), na.rm=T) %>%
                    dplyr::ungroup()%>%
                    dplyr::mutate(var_se=var_sd/sqrt(var_n))) 

vector_1 <- list_ready[[1]]

But this doesn't work. Any help is really really appreciated! (I really want to use the purrr package).

CodePudding user response:

Simplest way might be to a) put your calculation steps into a function processing a vector and returning a list of a tibble with the values you need and b) passing this into across instead (using iris as an example instead):

library(tidyverse)

mean_ci <- function(vars) {
  vars <- vars[!is.na(vars)]
  mn <- mean(vars)
  se <- sd(vars) / sqrt(length(vars))
  tibble(
    mean = mn,
    lower = mn - qt(1 - (0.05 / 2), length(vars) - 1) * se,
    upper = mn   qt(1 - (0.05 / 2), length(vars) - 1) * se
  )
}

iris |>
  group_by(Species) |>
  summarise(across(where(is.numeric), mean_ci)) |> 
  unnest(where(is_tibble), names_sep = "_")

#> # A tibble: 3 × 13
#>   Species    Sepal.Len…¹ Sepal…² Sepal…³ Sepal…⁴ Sepal…⁵ Sepal…⁶ Petal…⁷ Petal…⁸
#>   <fct>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1 setosa            5.01    4.91    5.11    3.43    3.32    3.54    1.46    1.41
#> 2 versicolor        5.94    5.79    6.08    2.77    2.68    2.86    4.26    4.13
#> 3 virginica         6.59    6.41    6.77    2.97    2.88    3.07    5.55    5.40
#> # … with 4 more variables: Petal.Length_upper <dbl>, Petal.Width_mean <dbl>,
#> #   Petal.Width_lower <dbl>, Petal.Width_upper <dbl>, and abbreviated variable
#> #   names ¹​Sepal.Length_mean, ²​Sepal.Length_lower, ³​Sepal.Length_upper,
#> #   ⁴​Sepal.Width_mean, ⁵​Sepal.Width_lower, ⁶​Sepal.Width_upper,
#> #   ⁷​Petal.Length_mean, ⁸​Petal.Length_lower

A more purrr-y way to do it might be to map the function to nested dataframes to create a slightly longer-form data output:

iris |> 
  nest(data = -Species) |> 
  mutate(data = map(data, ~tibble(metric = names(.x), map_df(.x, mean_ci)))) |> 
  unnest(data) 

#> # A tibble: 12 × 5
#>    Species    metric        mean lower upper
#>    <fct>      <chr>        <dbl> <dbl> <dbl>
#>  1 setosa     Sepal.Length 5.01  4.91  5.11 
#>  2 setosa     Sepal.Width  3.43  3.32  3.54 
#>  3 setosa     Petal.Length 1.46  1.41  1.51 
#>  4 setosa     Petal.Width  0.246 0.216 0.276
#>  5 versicolor Sepal.Length 5.94  5.79  6.08 
#>  6 versicolor Sepal.Width  2.77  2.68  2.86 
#>  7 versicolor Petal.Length 4.26  4.13  4.39 
#>  8 versicolor Petal.Width  1.33  1.27  1.38 
#>  9 virginica  Sepal.Length 6.59  6.41  6.77 
#> 10 virginica  Sepal.Width  2.97  2.88  3.07 
#> 11 virginica  Petal.Length 5.55  5.40  5.71 
#> 12 virginica  Petal.Width  2.03  1.95  2.10
  • Related