I`m trying to transform a column into multiple columns.
This is my example:
df <- data.frame(Id = c(1,2,3),
Col2 = c("['aaa' 'aaa aaaa' 'aaa.bb']","['aaa' 'aaa aaa' 'aaa bbb ccc' 'aaa'\n]","[]"))
df
Id Col2
1 ['aaa' 'aaa aaaa' 'aaa.bb']
2 ['aaa' 'aaa aaa' 'aaa bbb ccc' 'aaa'\n]
3 []
In my real case i can have 20 strings in each observation.
This would be my expected result:
df2 <- data.frame(Id =c(1,2,3),
Col1 = c("aaa","aaa",NA),
Col2 = c("aaa.aaaa","aaa.aaa",NA),
Col3 = c("aaa.bb","aaa bbb ccc",NA),
Col4 = c(NA,"aaa",NA))
df2
Id Col1 Col2 Col3 Col4
1 aaa aaa.aaaa aaa.bb NA
2 aaa aaa.aaa aaa bbb ccc aaa
3 NA NA NA NA
How can i separate by ""?
I tried to use separate function, but i can't seem to find the right sintax for "sep". "separate" seemed the best alternative for me, because I was using dplyr to maintain some previous columns.
Also, where can i find information about the using of sep. I haved saw some examples here, but I can't understand the rationale for using the characters
I also tried str_split, but i'm having trouble turning it back into a data frame.
Thanks in advance
CodePudding user response:
Add an explicit separator so that it is easy to get data in different columns. Here, I have added comma as a separator after every word group in single quotes ('....'
).
After that, perform some data cleaning using trimws
and gsub
, get data in long format using separate_rows
splitting on comma, add an id column with data.table::rowid
and get data in wide format in different columns using pivot_wider
.
library(dplyr)
library(tidyr)
df %>%
mutate(Col2 = gsub("('.*?')", "\\1,", Col2),
Col2 = trimws(Col2, whitespace = '\\[|\\]'),
Col2 = trimws(Col2, 'right', '[\n,]'),
Col2 = gsub("'", '', Col2)) %>%
separate_rows(Col2, sep = ',\\s*') %>%
mutate(row = data.table::rowid(Id)) %>%
pivot_wider(names_from = row, values_from = Col2, names_prefix = 'col')
# Id col1 col2 col3 col4
# <dbl> <chr> <chr> <chr> <chr>
#1 1 "aaa" aaa aaaa aaa.bb NA
#2 2 "aaa" aaa aaa aaa bbb ccc aaa
#3 3 "" NA NA NA
CodePudding user response:
You've essentially got quoted strings 'text text'
wrapped in []
.
So, strip out the []
's and read it like you would a normal text file with quoted strings:
cbind(
df["Id"],
read.table(text=gsub("[][]|\n", "", df$Col2), sep=" ", quote="'",
header=FALSE, fill=TRUE, blank.lines.skip=FALSE, na.strings="")
)
# Id V1 V2 V3 V4
#1 1 aaa aaa aaaa aaa.bb <NA>
#2 2 aaa aaa aaa aaa bbb ccc aaa
#3 3 <NA> <NA> <NA> <NA>