Home > Mobile >  Repairing column names which show up as ..x, by using the variable name that came before
Repairing column names which show up as ..x, by using the variable name that came before

Time:10-14

I have a survey dataset with a (faulty) structure as follows:

library(data.table)
dat <- fread("q1   q2 ...1  ..2  q3 ..1
       NA   response other  else response other
       1    4        NA   NA   1    NA")

dat
   q1       q2  ...1  ..2       q3   ..1
1: NA response other else response other
2:  1        4  <NA> <NA>        1  <NA>

There are two problems with the data structure

  1. The first row of the data, should actually be part of the column names (data starts at 2nd row)
  2. The ...1 and ..2 (and sometimes more), belong to the question (q) which comes before.

What I did so far is to paste row 1 to the column names and then remove it:

names(dat) <- paste0(names(dat), dat[1,])
dat<- dat[-1,]
colnames(dat) <-  sub("NA.*", "", colnames(dat))

   q1 q2response ...1other ..2else q3response ..1other
1:    1          4      <NA>    <NA>          1     <NA>

This however does not fix the ...1 and ..2.

What would be the best way to replace these parts with the correct question?

Desired outcome:

     q1 q2_response q2_other q2_else q3_response q3_other
1:    1          4      <NA>    <NA>          1     <NA>

CodePudding user response:

Here is my solution

  1. require packages (stringr, dplyr) (code 1 ~ 2)
  2. extract your column name array. (code 3)
  3. extract question title (with regular expression) and save as new column :"Ques" (code 4)
  4. remove the redundant string in the array and save as new column :"Ans" (code 5-6)
  5. split each question to different group with function (ifelse & cumsum) (code 7-8)
  6. use groupby syntax to bind "Ques" and "Ans" in each question group we already handled. (code 10-11)
  7. assign the new array "NewCol" to your dataframe's column name (code 13)
library(stringr)
library(dplyr)
CoLname <- data.frame(Names = c("q1","q2response", "...1other", "..2else", "q3response", "..1other"))
CoLname %>% mutate(Ques = str_extract(Names , pattern = "q[0-9]{1}"),
                   Ans = str_replace_all(Names , c("q[0-9]{1}" = "",
                                                   "^[\\.] [0-9] " ="")),
                   QuesGroup = ifelse(is.na(Ques), 0, 1),
                   QuesGroup = cumsum(QuesGroup)
                   ) %>%
            group_by(QuesGroup) %>%
            mutate(NewCol = ifelse(Ans == "", Ques ,paste0(first(Ques), "_", Ans ))) %>% ungroup() -> CoLname

colnames(YourDataframe) <- CoLname$NewCol

enter image description here

CodePudding user response:

Here is a possible data.table approach using zoo::na.locf() to carry forward the missing question numbers:

library(data.table)

## carry forward question numbers
nms <- zoo::na.locf(replace(names(dat), grepl("\\. ", names(dat)), NA))
## append first row to question numbers 
nms <- paste0(nms, ifelse(is.na(dat[1]), "", paste0("_", dat[1])))

setnames(dat, nms)
dat <- dat[-1]
dat
#>    q1 q2_response q2_other q2_else q3_response q3_other
#> 1:  1           4     <NA>    <NA>           1     <NA>
  •  Tags:  
  • r
  • Related