Home > Net >  How do I use Tidyverse and Pivot_Wider to create a Summary Table at Two Levels of Aggregation?
How do I use Tidyverse and Pivot_Wider to create a Summary Table at Two Levels of Aggregation?

Time:12-06

I have a large data set and would like to create a robust summary table that summarizes both (1) at the two levels of a factor variable (ExamType) but then also (2) at the disaggregated levels for the four years included in the data (Year) within each ExamType. Summary statistics desired include mean, sd, and n relative to ExamScore across multiple different other covariates, including all levels of Region, both levels of Gender, and all levels of Race.

I have difficulty with summarizing at both levels of aggregation as well as understanding how to use pivot wider for more than one variable (I tried and was only able to get a simple summary table at one level of aggregation across the levels of one covariate).

Here is what I'm hoping the table to look like:

|          | North    | South   | East     | West    | Male    | Female   | White  | etc.
| -------- | -------- |-------- | -------- |-------- | --------| -------- |--------|-etc.
| Exam A   | M |SD| n | M |SD| n| M |SD| n | M |SD| n| M |SD| n| M |SD| n | M |SD|n|
|   2019   | M |SD| n | M |SD| n| M |SD| n | M |SD| n| M |SD| n| M |SD| n | M |SD|n|
|   2020   | M |SD| n | M |SD| n| M |SD| n | M |SD| n| M |SD| n| M |SD| n | M |SD|n|
|   2021   | M |SD| n | M |SD| n| M |SD| n | M |SD| n| M |SD| n| M |SD| n | M |SD|n|
|   2022   | M |SD| n | M |SD| n| M |SD| n | M |SD| n| M |SD| n| M |SD| n | M |SD|n|
| Exam B   | M |SD| n | M |SD| n| M |SD| n | M |SD| n| M |SD| n| M |SD| n | M |SD|n|
|   2019   | M |SD| n | M |SD| n| M |SD| n | M |SD| n| M |SD| n| M |SD| n | M |SD|n|
|   2020   | M |SD| n | M |SD| n| M |SD| n | M |SD| n| M |SD| n| M |SD| n | M |SD|n|
|   2021   | M |SD| n | M |SD| n| M |SD| n | M |SD| n| M |SD| n| M |SD| n | M |SD|n|
|   2022   | M |SD| n | M |SD| n| M |SD| n | M |SD| n| M |SD| n| M |SD| n | M |SD|n|

(Sorry ran out of room.)

Here is some sample data:

structure(list(Year = c(2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 
2019L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2021L, 
2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2021L, 2022L, 2022L, 
2022L, 2022L, 2022L, 2022L, 2022L, 2022L), ExamType = c("A", 
"B", "A", "A", "B", "A", "B", "B", "B", "A", "B", "B", "A", "B", 
"B", "B", "A", "A", "A", "B", "B", "A", "B", "B", "A", "A", "A", 
"B", "A", "B"), ExamScore = c(1L, 2L, 2L, 3L, 1L, 4L, 4L, 5L, 
2L, 1L, 4L, 3L, 2L, 5L, 1L, 4L, 3L, 2L, 1L, 2L, 5L, 4L, 4L, 3L, 
1L, 2L, 5L, 4L, 3L, 1L), Region = c("North", "South", "East", 
"East", "North", "South", "West", "East", "South", "South", "West", 
"East", "North", "South", "West", "East", "North", "South", "West", 
"East", "North", "West", "West", "East", "North", "South", "West", 
"East", "West", "North"), Gender = c("M", "M", "F", "F", "M", 
"F", "F", "F", "M", "M", "M", "M", "M", "M", "F", "F", "M", "F", 
"M", "F", "F", "M", "F", "F", "F", "M", "M", "F", "M", "M"), 
    Race = c("White", "Black", "Other", "Other", "Latinx", "White", 
    "Latinx", "Black", "Other", "White", "Latinx", "Black", "Other", 
    "White", "Latinx", "Black", "Other", "White", "Latinx", "Latinx", 
    "Black", "Other", "White", "White", "Black", "Latinx", "White", 
    "Black", "Latinx", "Other")), class = "data.frame", row.names = c(NA, 
-30L))

Thank you!

CodePudding user response:

You can try the gtsummary package. Its not yet perfect, but a good starting point.

library(gtsummary)
library(tidyverse)
df %>%
  as_tibble() %>% 
  select(Year, ExamType, Region, Gender) %>% 
  unite(ID, ExamType, Year) %>% 
  tbl_strata(c("Region", "Gender") ,.tbl_fun = ~ .x %>% tbl_summary())

CodePudding user response:

Here's a simply tidyverse ay using pivot_wider

library(tidyverse)
df |> 
  group_by(ExamType, Year, Region) |> 
  summarize(mean = mean(ExamScore),
            SD   = sd(ExamScore),
            n    = n()) |> 
  pivot_wider(names_from = Region,
              values_from = c(mean, SD, n),
              names_vary = "slowest")

which gives:

# A tibble: 8 × 14
# Groups:   ExamType, Year [8]
  ExamType  Year mean_East SD_East n_East mean_North SD_North n_North mean_South SD_South n_South mean_West SD_West n_West
  <chr>    <int>     <dbl>   <dbl>  <int>      <dbl>    <dbl>   <int>      <dbl>    <dbl>   <int>     <dbl>   <dbl>  <int>
1 A         2019       2.5   0.707      2          1       NA       1        4      NA          1      NA     NA        NA
2 A         2020      NA    NA         NA          2       NA       1        1      NA          1      NA     NA        NA
3 A         2021      NA    NA         NA          3       NA       1        2      NA          1       2.5    2.12      2
4 A         2022      NA    NA         NA          1       NA       1        2      NA          1       4      1.41      2
5 B         2019      NA    NA         NA          1       NA       1        2      NA          1       4     NA         1
6 B         2020       4     1.41       2         NA       NA      NA        3.5     2.12       2       4     NA         1
7 B         2021       3     1.41       2          5       NA       1       NA      NA         NA       1     NA         1
8 B         2022       3.5   0.707      2          1       NA       1       NA      NA         NA       4     NA         1

Of course, the names are not exactly what you provided above. But the content should be.


Update:

you can add an aggregated row per Exam by using add_row and basically repeat the whole step as before by not also grouping by Year. As a last step, you need to arrange so that the NA years (which carry the aggregation per ExamType) appear first.

df |> 
  group_by(ExamType, Year, Region) |> 
  summarize(mean = mean(ExamScore),
            SD   = sd(ExamScore),
            n    = n()) |> 
  ungroup() |> 
  pivot_wider(names_from = Region,
              values_from = c(mean, SD, n),
              names_vary = "slowest") |> 
  add_row(df |> 
            group_by(ExamType, Region) |> 
            summarize(mean = mean(ExamScore),
                      SD   = sd(ExamScore),
                      n    = n()) |> 
            ungroup() |> 
            pivot_wider(names_from = Region,
                        values_from = c(mean, SD, n),
                        names_vary = "slowest")) |> 
  arrange(ExamType, match(Year, str_sort(Year, na_last = FALSE)))

# A tibble: 10 × 14
   ExamType  Year mean_East SD_East n_East mean_North SD_North n_North mean_South SD_South n_South mean_West SD_West n_West
   <chr>    <int>     <dbl>   <dbl>  <int>      <dbl>    <dbl>   <int>      <dbl>    <dbl>   <int>     <dbl>   <dbl>  <int>
 1 A           NA       2.5   0.707      2       1.75    0.957       4       2.25     1.26       4      3.25    1.71      4
 2 A         2019       2.5   0.707      2       1      NA           1       4       NA          1     NA      NA        NA
 3 A         2020      NA    NA         NA       2      NA           1       1       NA          1     NA      NA        NA
 4 A         2021      NA    NA         NA       3      NA           1       2       NA          1      2.5     2.12      2
 5 A         2022      NA    NA         NA       1      NA           1       2       NA          1      4       1.41      2
 6 B           NA       3.5   1.05       6       2.33    2.31        3       3        1.73       3      3.25    1.5       4
 7 B         2019      NA    NA         NA       1      NA           1       2       NA          1      4      NA         1
 8 B         2020       4     1.41       2      NA      NA          NA       3.5      2.12       2      4      NA         1
 9 B         2021       3     1.41       2       5      NA           1      NA       NA         NA      1      NA         1
10 B         2022       3.5   0.707      2       1      NA           1      NA       NA         NA      4      NA         1
  •  Tags:  
  • r
  • Related