I have a survey dataset with a (faulty) structure as follows:
library(data.table)
dat <- fread("q1 q2 ...1 ..2 q3 ..1
NA response other else response other
1 4 NA NA 1 NA")
dat
q1 q2 ...1 ..2 q3 ..1
1: NA response other else response other
2: 1 4 <NA> <NA> 1 <NA>
There are two problems with the data structure
- The first row of the data, should actually be part of the column names (data starts at 2nd row)
- The
...1
and..2
(and sometimes more), belong to the question (q
) which comes before.
What I did so far is to paste row 1 to the column names and then remove it:
names(dat) <- paste0(names(dat), dat[1,])
dat<- dat[-1,]
colnames(dat) <- sub("NA.*", "", colnames(dat))
q1 q2response ...1other ..2else q3response ..1other
1: 1 4 <NA> <NA> 1 <NA>
This however does not fix the ...1
and ..2
.
What would be the best way to replace these parts with the correct question?
Desired outcome:
q1 q2_response q2_other q2_else q3_response q3_other
1: 1 4 <NA> <NA> 1 <NA>
CodePudding user response:
Here is my solution
- require packages (stringr, dplyr) (code 1 ~ 2)
- extract your column name array. (code 3)
- extract question title (with regular expression) and save as new column :"Ques" (code 4)
- remove the redundant string in the array and save as new column :"Ans" (code 5-6)
- split each question to different group with function (ifelse & cumsum) (code 7-8)
- use groupby syntax to bind "Ques" and "Ans" in each question group we already handled. (code 10-11)
- assign the new array "NewCol" to your dataframe's column name (code 13)
library(stringr)
library(dplyr)
CoLname <- data.frame(Names = c("q1","q2response", "...1other", "..2else", "q3response", "..1other"))
CoLname %>% mutate(Ques = str_extract(Names , pattern = "q[0-9]{1}"),
Ans = str_replace_all(Names , c("q[0-9]{1}" = "",
"^[\\.] [0-9] " ="")),
QuesGroup = ifelse(is.na(Ques), 0, 1),
QuesGroup = cumsum(QuesGroup)
) %>%
group_by(QuesGroup) %>%
mutate(NewCol = ifelse(Ans == "", Ques ,paste0(first(Ques), "_", Ans ))) %>% ungroup() -> CoLname
colnames(YourDataframe) <- CoLname$NewCol
CodePudding user response:
Here is a possible data.table
approach using zoo::na.locf()
to carry forward the missing question numbers:
library(data.table)
## carry forward question numbers
nms <- zoo::na.locf(replace(names(dat), grepl("\\. ", names(dat)), NA))
## append first row to question numbers
nms <- paste0(nms, ifelse(is.na(dat[1]), "", paste0("_", dat[1])))
setnames(dat, nms)
dat <- dat[-1]
dat
#> q1 q2_response q2_other q2_else q3_response q3_other
#> 1: 1 4 <NA> <NA> 1 <NA>