Home > Software design >  Select a substring of characters from the end of a character string in R
Select a substring of characters from the end of a character string in R

Time:10-26

I have a vector of character strings

vec <- c("1ZQOYNBAA55", "2JSNHGKLRBB66", "3HVXCC77", "4LDD88", "5CIFMTLYXEE99")

> vec
[1] "1ZQOYNBAA55"   "2JSNHGKLRBB66" "3HVXCC77"      "4LDD88"        "5CIFMTLYXEE99"

...and I would like to get the last 3 characters from each string. To get the first 3 characters, I can use substr()

substr(vec,1,3)

I would have thought something like substr() with a "fromLast" argument might exist

vec_ends <- substr(vec,1,3, fromLast = TRUE)

With an expected output

> vec_ends
[1] "A55" "B66" "C77" "D88" "E99"

But substr() only works one way. In my dataset the string lengths are variable so no reference to absolute character numbers or string lengths can be made, and there are no consistent separators of delimiting characters for a string split. Does anyone know of an easy way to do this in R?

CodePudding user response:

Here is an approach that doesn't use regex (which often but not always means it's faster).

get_last_n_chars  <- function(vec, n = 3) {
    
    substr(vec, nchar(vec)-(n-1), nchar(vec))

}

get_last_n_chars(vec)
# [1]  "A55" "B66" "C77" "D88" "E99"

Benchmarking - just for fun

Often (usually?!) performance is irrelevant and you should use whatever code is clearest.

However, I was interested, and it does appear in this case that avoiding regex is faster. However the really big win is not using sapply(strsplit()) method - I actually had to cut off the final point from the plot because it broke the scale.

input_vec  <- c("1ZQOYNBAA55", "2JSNHGKLRBB66", "3HVXCC77", "4LDD88", "5CIFMTLYXEE99")

num_iterations  <- c(10, 1e3, 1e4)
results <- bench::press(
    rows = num_iterations,
    {
        vec  <- rep(input_vec, rows)
        bench::mark(
            min_iterations = 100,
            sub = {
                sub(".*(.{3})$", "\\1", vec)
            },
            gsub = {
                gsub(".*(.{3})$", "\\1", vec)
            },
            strsplit = {
                sapply(strsplit(vec, split=""),  function(x) paste(tail(x, 3), collapse = ""))
            },
            get_last_n_chars_fun = {
                get_last_n_chars(vec)
            },
            stringi = {
                out <- stringi::stri_reverse(vec)
                out <- substr(out,1,3)
                out <- stringi::stri_reverse(out)
                out
            }
        )
    }
)

Plot of results:

enter image description here

Output of autoplot(results) theme_bw():

enter image description here

CodePudding user response:

You could use a sub() approach:

vec_ends <- sub(".*(.{3})$", "\\1", vec)
vec_ends

[1] "A55" "B66" "C77" "D88" "E99"

CodePudding user response:

> gsub(".*(.{3})$", "\\1", vec)
[1] "A55" "B66" "C77" "D88" "E99"

Here´s an alternative without using regex:

> sapply(strsplit(vec, split=""),  function(x) paste(tail(x, 3), collapse = ""))
[1] "A55" "B66" "C77" "D88" "E99"

CodePudding user response:

I also found this package stringi, and with double use of stri_reverse() it can be done.

library(stringi)

out <- stri_reverse(vec)
out <- substr(out,1,3)
out <- stri_reverse(out)
  • Related