I have a dataframe with a column having few records with long string which I want to split based on delimeter \n-
for example in below dataframe I want to split the values in a second column based on delimeter from second record onwards by keeping first record as it is
Sr_no Col1
1 Lorem Ipsum is simply dummied text
2 \n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem
Ipsum is simply dummied text
3 \n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem
Ipsum is simply dummied text
. .
. .
. .
I wish to get an output like
Sr_no Col1 Col2 Col3
1 Lorem Ipsum is simply dummied text
2 Lorem Ipsum is simply dummied text Lorem Ipsum is simply dummied text Lorem Ipsum is simply dummied text
3 Lorem Ipsum is simply dummied text Lorem Ipsum is simply dummied text Lorem Ipsum is simply dummied text
. . .
. . .
. . .
I tried using str_split()
and str_split_fixed()
in R but I was not able to do. Thanks for any help
CodePudding user response:
Something like this should do it as long as you know the maximum number of parts you expect.
dat <- data.frame(
Sr_no = 1:3,
Col1 = c("Lorem Ipsum is simply dummied text",
"\n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem Ipsum is simply dummied text",
"\n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem Ipsum is simply dummied text")
)
dat %>%
dplyr::mutate(Col1 = str_remove(Col1, "^\n-\\s*")) %>% # assuming you dont want an empty first column for entries starting with the delim
tidyr::separate(Col1, into=c("Col1","Col2","Col3"), sep="\\w*,*\n-\\w*")
If you don't know how many colums you could split the column itself like this:
dat$Col1 %>%
stringr::str_remove("^\n-\\s*") %>%
stringr::str_split("\\w*,*\n-\\w*", simplify=T) %>%
data.frame()
CodePudding user response:
If you don't know how many columns you expect in the output, splitstackshape::cSplit()
works well.
Since you have a delimeter starting the strings in your example, you will end up with an empty column unless you remove those first, so I did that with stringr::str_replace()
which just replace
s the first pattern
.
library(tidyverse)
library(splitstackshape)
d <- tibble(Sr_no = 1:3,
Col1 = c("Lorem Ipsum is simply dummied text", rep("\n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem Ipsum is simply dummied text", 2)))
d %>%
mutate(Col1 = str_replace(Col1, "\\n- ", "")) %>%
cSplit("Col1", "\n-")
#> Sr_no Col1_1
#> 1: 1 Lorem Ipsum is simply dummied text
#> 2: 2 Lorem Ipsum is simply dummied text,
#> 3: 3 Lorem Ipsum is simply dummied text,
#> Col1_2 Col1_3
#> 1: <NA> <NA>
#> 2: Lorem Ipsum is simply dummied text , Lorem Ipsum is simply dummied text
#> 3: Lorem Ipsum is simply dummied text , Lorem Ipsum is simply dummied text
Created on 2022-07-17 by the reprex package (v2.0.1)