Home > Blockchain >  Split a column in a tibble into multiple columns to clean the data
Split a column in a tibble into multiple columns to clean the data

Time:09-22

I have data in untidy form. One of the columns contains combined information -

  • section (a, b)
  • question no. (1 or 01, 2 or 03, 3 or 03)
  • the question (name, address, ...)

I wish to create a separate column for each.

Below is an example with the desired result.

library(tidyverse)


tbl <- tibble(
  variable = c("a01_name", "a02_address", "a3_phone_number",
                "b1a_total_income", "b01b_no_of_cars")
)

# desired result
desired_tbl <- tibble(
  section = c("a", "a", "a", "b", "b"),
  question_no = c(1, 2, 3, 1, 1),
  sub_question_no = c(NA, NA, NA, "a", "b"),
  question = c("name", "address", "phone_number", "total_income", "no_of_cars")
)


CodePudding user response:

We can use extract

library(tidyr)
library(dplyr)
extract(tbl, variable, into = c("section", "question_no", "sub_question_no", 
   "question"), "^(\\D )(\\d )([a-z] )?_(.*)", convert = TRUE) %>%
  na_if("")

-output

# A tibble: 5 × 4
  section question_no sub_question_no question    
  <chr>         <int> <chr>           <chr>       
1 a                 1 <NA>            name        
2 a                 2 <NA>            address     
3 a                 3 <NA>            phone_number
4 b                 1 a               total_income
5 b                 1 b               no_of_cars  
  • Related