Extract characters between semicolons in r-CodePudding

Trying to extract data between semicolons and put that data into new columns.

Here is some data

df <- data.frame(data = c("a;;c;d", "a;b;;d","a;;;d","a;b;;;"), num =c(1:4))

Here is what I have scraped together so far from S.O.

res <- df %>% 
  mutate(
    colA = str_extract(data, "^[^;]*(?=;)"),
    colB = str_extract(data, "(?<=;)[^;]*(?=;)"),
    colC = str_extract(data, "(?<=;)(?<=;)[^;]*(?=;)"),
    colD = str_extract(data, "(?<=;)[^;]*$")
  )

It nearly does what I want but colC is the same as colB. I dont really understand regex so a solution and a explanation would be gratefully received.

CodePudding user response：

base R

cbind(df, read.csv2(text = df$data, header = FALSE))
#     data num V1 V2 V3 V4 V5
# 1 a;;c;d   1  a     c  d NA
# 2 a;b;;d   2  a  b     d NA
# 3  a;;;d   3  a        d NA
# 4 a;b;;;   4  a  b       NA

dplyr

library(dplyr)
df %>%
  mutate(read.csv2(text = data, header = FALSE))
#     data num V1 V2 V3 V4 V5
# 1 a;;c;d   1  a     c  d NA
# 2 a;b;;d   2  a  b     d NA
# 3  a;;;d   3  a        d NA
# 4 a;b;;;   4  a  b       NA

This works without explicit assignment because mutate (and summarize) will happily take a named-list (of which data.frame is a special -- and compatible -- case).

CodePudding user response：

Additionally to r2evans base R and dplyr:

data.table

library(data.table)

df <- data.table(data = c("a;;c;d", "a;b;;d","a;;;d","a;b;;;"), num =c(1:4))

df[, c("ColA", "ColB", "ColC", "ColD"):=tstrsplit(data, ";")]

df

     data num ColA ColB ColC ColD
1: a;;c;d   1    a         c    d
2: a;b;;d   2    a    b         d
3:  a;;;d   3    a              d
4: a;b;;;   4    a    b

CodePudding user response：

Another solution, using tidyr::separate:

library(tidyverse)

df <- data.frame(data = c("a;;c;d", "a;b;;d","a;;;d","a;b;;;"), num =c(1:4))

df %>% 
  separate(data, into = str_c("col", letters[1:4]), sep=";", extra="drop")

#>   cola colb colc cold num
#> 1    a         c    d   1
#> 2    a    b         d   2
#> 3    a              d   3
#> 4    a    b             4