Home > Net >  Reshape dataframe from long to wide but amount of columns different for every row
Reshape dataframe from long to wide but amount of columns different for every row

Time:10-20

I have the following dataframe format in R

question_id answer_id question_text answer_text
10000 0001 how many people are alive? 20000
10000 0002 how many people are alive? 50000
10000 0003 how many people are alive? 60000
10000 0004 how many people are alive? 900000
20000 0021 what is the meaning of life? yes
20000 0072 what is the meaning of life? no
20000 0083 what is the meaning of life? maybe
20000 0094 what is the meaning of life? ok
20000 0097 what is the meaning of life? indifference

I want it in the following format:

question_id question_text answer_text_1 answer_text_2 answer_text_3 answer_text_4 ... answer_text_n
10000 how many people are alive? 20000 50000 60000 900000 ... NA
20000 what is the meaning of life? yes no maybe ok ... indifference

So as you can see I want the question_id, then the question_text itself and then a set of column which equal to the maximum amount of answers there can be to a question. So some questions are true or false so would only have 2 columns filled in. But also multiple choice questions can exist, which might have 7 different options to choose from. I want it to be adaptive.

The only thing I could think of was tidyr::pivot_wider(). I can't seem to make this work. Any help would be greatly appreciated. Thanks!!

========== edit ===========

the code that I tried

qa_columns <- function(qa_df, question_list){
  df <- qa_df %>%
    dplyr::filter(question_id %in% question_list) %>%
    tidyr::pivot_wider(values_from = answer_text)
  return(df)
}

qa <- qa_columns(qa_text, question_list = question_list)

CodePudding user response:

Try adding a row-number per group, then keeping question_text as an id:

library(dplyr)
library(tidyr) # pivot_wider
dat %>%
  group_by(question_id) %>%
  mutate(rn = row_number()) %>%
  ungroup() %>%
  pivot_wider(c("question_id", "question_text"), names_from = "rn", names_prefix = "answer_text_", values_from = "answer_text")
# # A tibble: 2 x 7
#   question_id question_text                answer_text_1 answer_text_2 answer_text_3 answer_text_4 answer_text_5
#         <int> <chr>                        <chr>         <chr>         <chr>         <chr>         <chr>        
# 1       10000 how many people are alive?   20000         50000         60000         900000        <NA>         
# 2       20000 what is the meaning of life? yes           no            maybe         ok            indifference 

Note that this coerces numbers to character in your answer_text_# columns, generally unavoidable.

CodePudding user response:

Using dcast from data.table

library(data.table)
dcast(setDT(dat), question_id   question_text ~ 
   paste0("answer_text_", rowid(question_id)), value.var = "answer_text")

-output

 question_id                question_text answer_text_1 answer_text_2 answer_text_3 answer_text_4 answer_text_5
1:       10000   how many people are alive?         20000         50000         60000        900000          <NA>
2:       20000 what is the meaning of life?           yes            no         maybe            ok  indifference

data

dat <- structure(list(question_id = c(10000L, 10000L, 10000L, 10000L, 
20000L, 20000L, 20000L, 20000L, 20000L), answer_id = c(1L, 2L, 
3L, 4L, 21L, 72L, 83L, 94L, 97L), question_text = c("how many people are alive?", 
"how many people are alive?", "how many people are alive?", "how many people are alive?", 
"what is the meaning of life?", "what is the meaning of life?", 
"what is the meaning of life?", "what is the meaning of life?", 
"what is the meaning of life?"), answer_text = c("20000", "50000", 
"60000", "900000", "yes", "no", "maybe", "ok", "indifference"
)), class = "data.frame", row.names = c(NA, -9L))
  •  Tags:  
  • r
  • Related