Home > front end >  R Tidyverse: how to create a weighted count table of multiple factor variables
R Tidyverse: how to create a weighted count table of multiple factor variables

Time:11-04

This question relates to summarising survey results

The case is where there are multiple variants of the same question so

df <- data.frame(
    survey_question_subquestionA = c("Less","More","Less","More","Missing","More",Less") 
    survey_question_subquestionB = c("Less","More","Less","More","Missing","More",Less")
    survey_question_subquestionC = c("Less","More","Less","More","Missing","More",Less")
    survey_weights = c( 1.02, 1.05, 0.97, 0.92, 0.94, 1.03, 0.98 )

The objective is to end up with a table

More Less Missing
subquestionA n n n
subquestionB n n n
subquestionC n n n
Total n n n
qn <- "Question"
wt <- "Survey_Weights"
tvdf |> filter(Question == qn) |> select(Variable) |> unlist() |> unname() -> CVL

In real life I extract the variable names from another dataframe containing meta information called tvdf and the data frame itself has lots of other questions and surveys in it.

This works for unweighted counts

df |> select(any_of(CVL)) |> pivot_longer(cols = contains(qn), names_to = "Variable", values_to = "Values") |> group_by(Variable) |> count(Values) |> pivot_wider(names_from = Values, values_from = n) -> table

the problem is

count(Values, wt = get(wt)) 

doesnt work because they were excluded at select stage. If I include them they end up in my table.

Any suggestions welcome. I'm open to data.table solutions too.

CodePudding user response:

Is this what you're looking for? For the addition of the "Total" row, I used the janitor package which has a handy function adorn_totals().

library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.1
#> Warning: package 'tibble' was built under R version 4.2.1
#> Warning: package 'tidyr' was built under R version 4.2.1
#> Warning: package 'readr' was built under R version 4.2.1
#> Warning: package 'purrr' was built under R version 4.2.1
#> Warning: package 'dplyr' was built under R version 4.2.1
#> Warning: package 'stringr' was built under R version 4.2.1
#> Warning: package 'forcats' was built under R version 4.2.1

df <- tibble(
  survey_question_subquestionA = c("Less",
                                   "More",
                                   "Less",
                                   "More",
                                   "Missing",
                                   "More",
                                   "Less"),
  survey_question_subquestionB = c("Less", "More", "Less", "More", "Missing", "More", "Less"),
  survey_question_subquestionC = c("Less",
                                   "More",
                                   "Less",
                                   "More",
                                   "Missing",
                                   "More",
                                   "Less"),
  survey_weights = c(1.02, 1.05, 0.97, 0.92, 0.94, 1.03, 0.98)
)

df |>
  pivot_longer(cols = 1:3,
               names_to = "question") |>
  count(question, value, wt = survey_weights) |>
  pivot_wider(names_from = value,
              values_from = n) |> 
  janitor::adorn_totals()
#>                      question Less Missing More
#>  survey_question_subquestionA 2.97    0.94    3
#>  survey_question_subquestionB 2.97    0.94    3
#>  survey_question_subquestionC 2.97    0.94    3
#>                         Total 8.91    2.82    9

Created on 2022-11-04 with reprex v2.0.2

  • Related