Separate a column based on symbol while keeping the symbol in the first column-CodePudding

I have a dataset which I need to split the column into two based on the : symbol. However, I do want to keep the : in the first column. How to achieve that?

Here is the dataset:

dd <- data.frame(col1=c("*MOT:0 .",
"*CHI:byebye .",
"*MOT:yeah byebye .",
"*CHI:0 [>] .",
"*MOT:<what are you gonna do now> [<] ?",
"*CHI:gonna do .",
"*MOT:<what's that [= block]> [>] ?"))

dd
                                 col1
                               *MOT:0 .
                          *CHI:byebye .
                     *MOT:yeah byebye .
                           *CHI:0 [>] .
 *MOT:<what are you gonna do now> [<] ?
                        *CHI:gonna do .
     *MOT:<what's that [= block]> [>] ?

In the end, I want this:

  col1   col2
  *MOT:  0 .
  *CHI:  byebye .
  *MOT:  yeah byebye .
  *CHI:  0 [>] .
  *MOT:  <what are you gonna do now> [<] ?
  *CHI:  gonna do .
  *MOT:  <what's that [= block]> [>] ?

Any help will be greatly appreciated!

CodePudding user response：

You can use tidyr::separate with a lookbehind regex

tidyr::separate(dd, col1, '(?<=:)', into = c('col1', 'col2'))
#>    col1                              col2
#> 1 *MOT:                               0 .
#> 2 *CHI:                          byebye .
#> 3 *MOT:                     yeah byebye .
#> 4 *CHI:                           0 [>] .
#> 5 *MOT: <what are you gonna do now> [<] ?
#> 6 *CHI:                        gonna do .
#> 7 *MOT:     <what's that [= block]> [>] ?

CodePudding user response：

With extract:

tidyr::extract(dd, col1, "(.*\\:)(.*)", into = c("col1", "col2"))

   col1                              col2
1 *MOT:                               0 .
2 *CHI:                          byebye .
3 *MOT:                     yeah byebye .
4 *CHI:                           0 [>] .
5 *MOT: <what are you gonna do now> [<] ?
6 *CHI:                        gonna do .
7 *MOT:     <what's that [= block]> [>] ?

Note that extract is superseded in favor of separate_wider_regex:

separate_wider_regex(dd, col1, c(col1 = ".*\\:", col2 = ".*"))

Or in base R with strcapture:

strcapture("(.*\\:)(.*)", dd$col1, proto = data.frame(col1 = "", col2 = ""))

CodePudding user response：

A base R approach using strsplit with lapply

setNames(data.frame(do.call(rbind, 
  lapply(strsplit(dd$col1, ":"), function(x) 
    c(paste0(x[1], ":"), x[2])))), c("col1", "col2"))
   col1                              col2
1 *MOT:                               0 .
2 *CHI:                          byebye .
3 *MOT:                     yeah byebye .
4 *CHI:                           0 [>] .
5 *MOT: <what are you gonna do now> [<] ?
6 *CHI:                        gonna do .
7 *MOT:     <what's that [= block]> [>] ?

CodePudding user response：

Using stringr::str_split dplyr

library(dplyr)
stringr::str_split(dd$col1,"(?<=:)",simplify = T)%>%
  as.data.frame() %>%
  rename(col1=V1,
         col2=V2) 

   col1                              col2
1 *MOT:                               0 .
2 *CHI:                          byebye .
3 *MOT:                     yeah byebye .
4 *CHI:                           0 [>] .
5 *MOT: <what are you gonna do now> [<] ?
6 *CHI:                        gonna do .
7 *MOT:     <what's that [= block]> [>] ?

CodePudding user response：

Using base R with read.table

read.table(text = sub(":", ":,", dd$col1),
   header = FALSE, sep = ",", col.names = c("col1", "col2"))

-output

   col1                              col2
1 *MOT:                               0 .
2 *CHI:                          byebye .
3 *MOT:                     yeah byebye .
4 *CHI:                           0 [>] .
5 *MOT: <what are you gonna do now> [<] ?
6 *CHI:                        gonna do .
7 *MOT:    <whats that [= block]> [>] ?