I have the following dataframe format in R
question_id | answer_id | question_text | answer_text |
---|---|---|---|
10000 | 0001 | how many people are alive? | 20000 |
10000 | 0002 | how many people are alive? | 50000 |
10000 | 0003 | how many people are alive? | 60000 |
10000 | 0004 | how many people are alive? | 900000 |
20000 | 0021 | what is the meaning of life? | yes |
20000 | 0072 | what is the meaning of life? | no |
20000 | 0083 | what is the meaning of life? | maybe |
20000 | 0094 | what is the meaning of life? | ok |
20000 | 0097 | what is the meaning of life? | indifference |
I want it in the following format:
question_id | question_text | answer_text_1 | answer_text_2 | answer_text_3 | answer_text_4 | ... | answer_text_n |
---|---|---|---|---|---|---|---|
10000 | how many people are alive? | 20000 | 50000 | 60000 | 900000 | ... | NA |
20000 | what is the meaning of life? | yes | no | maybe | ok | ... | indifference |
So as you can see I want the question_id, then the question_text itself and then a set of column which equal to the maximum amount of answers there can be to a question. So some questions are true or false so would only have 2 columns filled in. But also multiple choice questions can exist, which might have 7 different options to choose from. I want it to be adaptive.
The only thing I could think of was tidyr::pivot_wider()
. I can't seem to make this work. Any help would be greatly appreciated. Thanks!!
========== edit ===========
the code that I tried
qa_columns <- function(qa_df, question_list){
df <- qa_df %>%
dplyr::filter(question_id %in% question_list) %>%
tidyr::pivot_wider(values_from = answer_text)
return(df)
}
qa <- qa_columns(qa_text, question_list = question_list)
CodePudding user response:
Try adding a row-number per group, then keeping question_text
as an id:
library(dplyr)
library(tidyr) # pivot_wider
dat %>%
group_by(question_id) %>%
mutate(rn = row_number()) %>%
ungroup() %>%
pivot_wider(c("question_id", "question_text"), names_from = "rn", names_prefix = "answer_text_", values_from = "answer_text")
# # A tibble: 2 x 7
# question_id question_text answer_text_1 answer_text_2 answer_text_3 answer_text_4 answer_text_5
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 10000 how many people are alive? 20000 50000 60000 900000 <NA>
# 2 20000 what is the meaning of life? yes no maybe ok indifference
Note that this coerces numbers to character
in your answer_text_#
columns, generally unavoidable.
CodePudding user response:
Using dcast
from data.table
library(data.table)
dcast(setDT(dat), question_id question_text ~
paste0("answer_text_", rowid(question_id)), value.var = "answer_text")
-output
question_id question_text answer_text_1 answer_text_2 answer_text_3 answer_text_4 answer_text_5
1: 10000 how many people are alive? 20000 50000 60000 900000 <NA>
2: 20000 what is the meaning of life? yes no maybe ok indifference
data
dat <- structure(list(question_id = c(10000L, 10000L, 10000L, 10000L,
20000L, 20000L, 20000L, 20000L, 20000L), answer_id = c(1L, 2L,
3L, 4L, 21L, 72L, 83L, 94L, 97L), question_text = c("how many people are alive?",
"how many people are alive?", "how many people are alive?", "how many people are alive?",
"what is the meaning of life?", "what is the meaning of life?",
"what is the meaning of life?", "what is the meaning of life?",
"what is the meaning of life?"), answer_text = c("20000", "50000",
"60000", "900000", "yes", "no", "maybe", "ok", "indifference"
)), class = "data.frame", row.names = c(NA, -9L))