Home > database >  How to split a long string based on delimiter in R
How to split a long string based on delimiter in R

Time:07-18

I have a dataframe with a column having few records with long string which I want to split based on delimeter \n-

for example in below dataframe I want to split the values in a second column based on delimeter from second record onwards by keeping first record as it is

Sr_no    Col1
1        Lorem Ipsum is simply dummied text
2        \n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem 
         Ipsum is simply dummied text

3        \n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem 
         Ipsum is simply dummied text
.        .
.        .
.        .


I wish to get an output like

Sr_no    Col1                                    Col2                   Col3                                   
1        Lorem Ipsum is simply dummied text
2        Lorem Ipsum is simply dummied text      Lorem Ipsum is simply dummied text    Lorem Ipsum is simply dummied text

3        Lorem Ipsum is simply dummied text      Lorem Ipsum is simply dummied text    Lorem Ipsum is simply dummied text
.        .                                       .
.        .                                       .
.        .                                       .

I tried using str_split() and str_split_fixed() in R but I was not able to do. Thanks for any help

CodePudding user response:

Something like this should do it as long as you know the maximum number of parts you expect.

dat <- data.frame(
  Sr_no = 1:3,
  Col1 = c("Lorem Ipsum is simply dummied text",
           "\n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem Ipsum is simply dummied text",
           "\n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem Ipsum is simply dummied text")
) 


dat %>%
  dplyr::mutate(Col1 = str_remove(Col1, "^\n-\\s*")) %>% # assuming you dont want an empty first column for entries starting with the delim
  tidyr::separate(Col1, into=c("Col1","Col2","Col3"), sep="\\w*,*\n-\\w*")

If you don't know how many colums you could split the column itself like this:

dat$Col1 %>%
  stringr::str_remove("^\n-\\s*") %>%
  stringr::str_split("\\w*,*\n-\\w*", simplify=T) %>%
  data.frame() 

CodePudding user response:

If you don't know how many columns you expect in the output, splitstackshape::cSplit() works well.

Since you have a delimeter starting the strings in your example, you will end up with an empty column unless you remove those first, so I did that with stringr::str_replace() which just replaces the first pattern.

library(tidyverse)
library(splitstackshape) 

d <- tibble(Sr_no = 1:3,
            Col1 = c("Lorem Ipsum is simply dummied text", rep("\n- Lorem Ipsum is simply dummied text,\n- Lorem Ipsum is simply dummied text ,\n- Lorem Ipsum is simply dummied text", 2)))

d %>% 
  mutate(Col1 = str_replace(Col1, "\\n- ", "")) %>% 
  cSplit("Col1", "\n-")
#>    Sr_no                              Col1_1
#> 1:     1  Lorem Ipsum is simply dummied text
#> 2:     2 Lorem Ipsum is simply dummied text,
#> 3:     3 Lorem Ipsum is simply dummied text,
#>                                  Col1_2                             Col1_3
#> 1:                                 <NA>                               <NA>
#> 2: Lorem Ipsum is simply dummied text , Lorem Ipsum is simply dummied text
#> 3: Lorem Ipsum is simply dummied text , Lorem Ipsum is simply dummied text

Created on 2022-07-17 by the reprex package (v2.0.1)

  •  Tags:  
  • r
  • Related