The output of a survey from Qualtrics codes question choices where multiple responses can be recorded such as race/ethnicity demographics (ex. below), in a manner that I can't come up with an easy solution to analyse. It records the chosen checkboxes per row under each choice (which is in its own column), and unchosen choices remain blank. I have decided that a good place to start would be to count the non-"NAs" for each choice. However, it is not working out the way I had planned and a rigorous scouring of available solutions has been less than useful. I figured out a way to get column counts using apply, but the output is still a bit clumsy to deal with. I have a dataframe with many columns that will need to be analysed in this way, so I am using grep function to select the relevant columns that require choice-counts.
data:
structure(list(race_White = c("White", NA, NA, "White", NA, NA,
"White", "White", NA, "White", "White", "White", "White", "White",
"White", "White", "White", "White", "White", "White", NA, "White",
"White", "White", NA), `race_Black or African American` = c(NA,
NA, "Black or African American", NA, NA, "Black or African American",
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, "Black or African American"), `race_American Indian or Alaska Native` = c(NA,
NA, NA, NA, NA, NA, NA, NA, "American Indian or Alaska Native",
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), race_Asian = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, "Asian", NA, NA, NA, NA),
`race_Middle Eastern or North African` = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), `race_Hispanic, Latino or Spanish` = c(NA, "Hispanic, Latino or Spanish",
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), `race_Native Hawaiian or Pacific Islander` = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), `race_ Prefer not to share` = c(NA, NA, NA, NA, "Prefer not to share",
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA), race_Other = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), education_level = structure(c(3L,
2L, 5L, 4L, 6L, 3L, 6L, 2L, 3L, 3L, 5L, 2L, 5L, 5L, 3L, 3L,
5L, 2L, 5L, 5L, 5L, 3L, 3L, 3L, 5L), .Label = c("Less than high school degree",
"High school graduate (high school diploma or equivalent)",
"Some college but no degree", "Associate's degree (2-year)",
"Bachelor's degree (4-year)", "Master's degree", "Doctoral/Professional degree (PhD, MD, JD)",
"Other/Prefer not to share"), class = "factor"), age = c(74,
43, NA, 37, 61, 64, NA, NA, 45, NA, NA, 21, NA, NA, 52, 43,
43, NA, 65, 42, NA, 27, 35, NA, 46)), row.names = c(NA, -25L
), class = c("tbl_df", "tbl", "data.frame"))
I have used grep to selct the column numbers I would like to count the choices for by using the following:
race<-c(grep("race", colnames(data)))
Then, I also used the column names in case the formula required names instead of numbers
racenames<-colnames(data[race])
After I created these selections I tried to get some sort of table of counts of the rows not equal to "" using the following, (which did not work out)
racecounts <- sapply(data[race],FUN = function(x){length(x[x!=""])})
racecounts
This basically sums up each row in the column, not the non-empty ones that I hoped it would. So I tried just a simple apply function and it did work:
racecounts2 <- apply(data[race], 2, table)
racecounts2
This works and then I have to convert it to a prop.table to get the proportions to use with kable
racecounts2<-prop.table(racecounts2)
racecounts2%>%
kbl() %>%
kable_material_dark()
I am just curious if anyone has found alternate/better ways to deal with this data format? I'm willing to try anything different, this one just seems clumsy and the outputs from it leave a bit to the imagination. It would be nice to find a way to deal with these data that allow ranking/plotting etc. to be carried out easier going forward.
So I'm just curious as to how the community would have done it.
CodePudding user response:
You can count the number of non-NA values using !is.na
for the race columns like this:
colSums(!is.na(data[race]))
or, using dplyr
grammar and tidyr::pivot_longer
to make it more look like a table:
data %>% select(starts_with("race")) %>%
summarise(across(everything(), ~sum(!is.na(.x)))) %>%
pivot_longer(cols=everything(), names_to = "race", values_to = "count",
names_transform = list(race = \(x) str_remove(x, "race_")))
# A tibble: 9 x 2
race count
<chr> <int>
1 "White" 18
2 "Black or African American" 3
3 "American Indian or Alaska Native" 1
4 "Asian" 1
5 "Middle Eastern or North African" 0
6 "Hispanic, Latino or Spanish" 1
7 "Native Hawaiian or Pacific Islander" 0
8 " Prefer not to share" 1
9 "Other" 0