How to split a long string based on delimiter in R-CodePudding

I have a dataframe with a column having few records with long string which I want to split based on delimeter \n-

for example in below dataframe I want to split the values in a second column based on delimeter from second record onwards by keeping first record as it is

Sr_no    Col1
1        Lorem Ipsum is simply dummied text
2        \n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem 
         Ipsum is simply dummied text

3        \n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem 
         Ipsum is simply dummied text
.        .
.        .
.        .

I wish to get an output like

Sr_no    Col1                                    Col2                   Col3                                   
1        Lorem Ipsum is simply dummied text
2        Lorem Ipsum is simply dummied text      Lorem Ipsum is simply dummied text    Lorem Ipsum is simply dummied text

3        Lorem Ipsum is simply dummied text      Lorem Ipsum is simply dummied text    Lorem Ipsum is simply dummied text
.        .                                       .
.        .                                       .
.        .                                       .

I tried using str_split() and str_split_fixed() in R but I was not able to do. Thanks for any help

CodePudding user response：

Something like this should do it as long as you know the maximum number of parts you expect.

dat <- data.frame(
  Sr_no = 1:3,
  Col1 = c("Lorem Ipsum is simply dummied text",
           "\n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem Ipsum is simply dummied text",
           "\n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem Ipsum is simply dummied text")
) 


dat %>%
  dplyr::mutate(Col1 = str_remove(Col1, "^\n-\\s*")) %>% # assuming you dont want an empty first column for entries starting with the delim
  tidyr::separate(Col1, into=c("Col1","Col2","Col3"), sep="\\w*,*\n-\\w*")

If you don't know how many colums you could split the column itself like this:

dat$Col1 %>%
  stringr::str_remove("^\n-\\s*") %>%
  stringr::str_split("\\w*,*\n-\\w*", simplify=T) %>%
  data.frame()

CodePudding user response：

If you don't know how many columns you expect in the output, splitstackshape::cSplit() works well.

Since you have a delimeter starting the strings in your example, you will end up with an empty column unless you remove those first, so I did that with stringr::str_replace() which just replaces the first pattern.

library(tidyverse)
library(splitstackshape) 

d <- tibble(Sr_no = 1:3,
            Col1 = c("Lorem Ipsum is simply dummied text", rep("\n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem Ipsum is simply dummied text", 2)))

d %>% 
  mutate(Col1 = str_replace(Col1, "\\n- ", "")) %>% 
  cSplit("Col1", "\n-")
#>    Sr_no                              Col1_1
#> 1:     1  Lorem Ipsum is simply dummied text
#> 2:     2 Lorem Ipsum is simply dummied text,
#> 3:     3 Lorem Ipsum is simply dummied text,
#>                                  Col1_2                             Col1_3
#> 1:                                 <NA>                               <NA>
#> 2: Lorem Ipsum is simply dummied text , Lorem Ipsum is simply dummied text
#> 3: Lorem Ipsum is simply dummied text , Lorem Ipsum is simply dummied text

^{Created on 2022-07-17 by the reprex package (v2.0.1)}