Faster version of strsplit in R-CodePudding

I have a data such as sequence of string where text and number type alternate: e.g. VID22CAS05, TEL21XSE12 and I need to check the length of items after parsing, e.g. VID22CAS05 -> VID 22 CAS 05 => length of 4.

data <- c("VID22CAS05", "TEL21XSE12")

string_lengths<-purrr::map(data, function(x){
    x_sep <- trimws(x = gsub("(\\d |[A-Za-z] )", "\\1 ", x), which = "both")
    length <- strsplit(x_sep, " ")[[1]]
})

This works fine but the problem is that this is very slow for huge dataset.

Is there any way, how to speed this up?

CodePudding user response：

Will this do?

lengths(gregexpr('\\d |[a-zA-Z] ', data))
# [1] 4 4

CodePudding user response：

How about rethinking the problem, when we know text and number type alternate? Counting the occurrence of digits only and inferring the character count could speed things up further.

library(stringr)

str_count(data, "\\d ")*2

Output:

[1] 4 4 6

Data:

data<- c("VID22CAS05", "TEL21XSE12", "TEL21XSE12XSE12")

CodePudding user response：

A base R option with gsub and nchar

nchar(gsub("\\D ", "", data))
[1] 4 4