Improve speed for substring when separating text by number of characters-CodePudding

My data looks like that:

library(data.table)

fake_data <- c(
  "A002JOHN 22M",
  "A003JAMES25M",
  "B234LEE  28M"
)

It's a text file where I have to separate the text in columns according to the number of characters. For example, the first 4 characters are one column, the next 5 characters are another column, etc.

Expected output:

    V1    V2 V3 V4
1 A002 JOHN  22  M
2 A003 JAMES 25  M
3 B234 LEE   28  M

The method I use for now is the following:

method_1 <- function() {
  tmp <- list()
  for (i in seq_along(fake_data)) {
    tmp[[i]] <- substring(fake_data[i], c(1, 5, 10, 12), c(4, 9, 11, 12))
  }
  
  do.call(rbind, tmp) %>% 
    as.data.frame()
}

I thought I could replace do.call and as.data.frame with the following:

method_2 <- function() {
  tmp <- list()
  for (i in seq_along(fake_data)) {
    tmp[[i]] <- substring(fake_data[i], c(1, 5, 10, 12), c(4, 9, 11, 12))
  }
  
  setDT(transpose(tmp))
}

But it makes it slower:

microbenchmark::microbenchmark(
  method_1(),
  method_2(),
  times = 1000
)

> Unit: microseconds
       expr  min    lq     mean median     uq    max neval
 method_1() 50.4  61.4  83.8550   68.8  75.55 4348.9  1000
 method_2() 94.2 108.9 141.1236  115.6 127.30 1211.8  1000

Any idea of how I could improve the speed for this operation?

CodePudding user response：

I don't think my solution is faster than yours in the post, but this might be another option maybe you would be interested

read.table(
  text = gsub(
    "(\\D\\d )(\\D )(\\d )(.*)",
    "\\1 \\2 \\3 \\4", 
    fake_data
  )
)

which gives

    V1    V2 V3 V4
1 A002  JOHN 22  M
2 A003 JAMES 25  M
3 B234   LEE 28  M

Explanation In gsub, we use () to group the patterns we want to put into the data frame by columns.

(\\D\\d ): non-digit character followed by digits
(\\D ): non-digit characters in a row
(\\d ): digit characters in a row
(.*): anything left

CodePudding user response：

You can use tidyr::separate:

library(tidyr)

fake_data <- data.frame(
  V1=c("A002JOHN 22M",
  "A003JAMES25M",
  "B234LEE  28M"
))

fake_data |> 
  separate(V1, into=c("V1", "V2", "V3", "V4"), sep=c(4,9,11))
#>     V1    V2 V3 V4
#> 1 A002 JOHN  22  M
#> 2 A003 JAMES 25  M
#> 3 B234 LEE   28  M

^{Created on 2021-12-14 by the reprex package (v2.0.1)}

If your fields do not have fixed width, you can use regular expression within sep=.