Home > Software design >  Define separator in "separate" function or use alternative
Define separator in "separate" function or use alternative

Time:12-10

I`m trying to transform a column into multiple columns.

This is my example:

df <- data.frame(Id = c(1,2,3), 
                 Col2 = c("['aaa' 'aaa aaaa' 'aaa.bb']","['aaa' 'aaa aaa' 'aaa bbb ccc' 'aaa'\n]","[]"))    
df


Id   Col2

1   ['aaa' 'aaa aaaa' 'aaa.bb']         
2   ['aaa' 'aaa aaa' 'aaa bbb ccc' 'aaa'\n]         
3   []

In my real case i can have 20 strings in each observation.

This would be my expected result:

df2 <- data.frame(Id =c(1,2,3),
             Col1 = c("aaa","aaa",NA),
             Col2 = c("aaa.aaaa","aaa.aaa",NA),
             Col3 = c("aaa.bb","aaa bbb ccc",NA),
             Col4 = c(NA,"aaa",NA))
df2


Id  Col1    Col2     Col3        Col4

1   aaa   aaa.aaaa  aaa.bb        NA
2   aaa   aaa.aaa   aaa bbb ccc  aaa
3   NA      NA       NA           NA

How can i separate by ""?

I tried to use separate function, but i can't seem to find the right sintax for "sep". "separate" seemed the best alternative for me, because I was using dplyr to maintain some previous columns.

Also, where can i find information about the using of sep. I haved saw some examples here, but I can't understand the rationale for using the characters

I also tried str_split, but i'm having trouble turning it back into a data frame.

Thanks in advance

CodePudding user response:

Add an explicit separator so that it is easy to get data in different columns. Here, I have added comma as a separator after every word group in single quotes ('....').

After that, perform some data cleaning using trimws and gsub, get data in long format using separate_rows splitting on comma, add an id column with data.table::rowid and get data in wide format in different columns using pivot_wider.

library(dplyr)
library(tidyr)

df %>%
  mutate(Col2 = gsub("('.*?')", "\\1,", Col2), 
         Col2 = trimws(Col2, whitespace = '\\[|\\]'), 
         Col2 = trimws(Col2, 'right', '[\n,]'), 
         Col2 = gsub("'", '', Col2)) %>%
  separate_rows(Col2, sep = ',\\s*') %>%
  mutate(row = data.table::rowid(Id)) %>%
  pivot_wider(names_from = row, values_from = Col2, names_prefix = 'col')

#     Id col1  col2     col3        col4 
#  <dbl> <chr> <chr>    <chr>       <chr>
#1     1 "aaa" aaa aaaa aaa.bb      NA   
#2     2 "aaa" aaa aaa  aaa bbb ccc aaa  
#3     3 ""    NA       NA          NA   

CodePudding user response:

You've essentially got quoted strings 'text text' wrapped in []. So, strip out the []'s and read it like you would a normal text file with quoted strings:

cbind(
  df["Id"],
  read.table(text=gsub("[][]|\n", "", df$Col2), sep=" ", quote="'",
             header=FALSE, fill=TRUE, blank.lines.skip=FALSE, na.strings="")
)
#  Id   V1       V2          V3   V4
#1  1  aaa aaa aaaa      aaa.bb <NA>
#2  2  aaa  aaa aaa aaa bbb ccc  aaa
#3  3 <NA>     <NA>        <NA> <NA>   
  • Related