Home > Software engineering >  Extract characters that repeats in same position in vector
Extract characters that repeats in same position in vector

Time:10-20

How can I get the characters that repeat at the same position in a string?

df <- data.frame(col1 = paste0("a", LETTERS[1:5]),
                 col2 = paste0("b", letters[1:5]),
                 col3 = paste0("ccde", letters[1:5]),
                 col4 = paste0('?', letters[1:2], 1:5),
                 col5 = paste0(1:5, 'hello you', letters[1:2], 1:5),
                 col6 = paste0('hello', letters[1:2], 1:5, "you"),
                 col7 = c("hello1 you", "hello you2", "hello3 you", "hello you4", "hello5 you"))

#   col1 col2  col3 col4         col5       col6       col7
# 1   aA   ba ccdea  ?a1 1hello youa1 helloa1you hello1 you
# 2   aB   bb ccdeb  ?b2 2hello youb2 hellob2you hello you2
# 3   aC   bc ccdec  ?a3 3hello youa3 helloa3you hello3 you
# 4   aD   bd ccded  ?b4 4hello youb4 hellob4you hello you4
# 5   aE   be ccdee  ?a5 5hello youa5 helloa5you hello5 you

result <- c("a", "b", "ccde", "?", "hello you", "helloyou", "hello")

CodePudding user response:

Here is one possible approach. Convert the strings to raw, test the first against the rest for equality, reduce the result, and use it to subset the first before reconverting it back to character. You could do the same approach with strsplit() but I believe that would be slightly less efficient.

f <- function(x) {
  cv <- charToRaw(x[1])
  rawToChar(cv[Reduce(`&`, lapply(x[-1], \(y) cv == charToRaw(y)))])
}

sapply(df, f2) 
#  col1        col2        col3        col4        col5        col6        col7 
#   "a"         "b"      "ccde"         "?" "hello you"  "helloyou"     "hello"

CodePudding user response:

Update

As per your update on the expected output, it seems we can simplify f a little bit

f <- function(v) {
  l <- min(nchar(v))
  res <- vector("list", l)
  for (i in 1:l) {
    x <- substr(v, i, i)
    if (length(unique(x)) == 1) {
      res[[i]] <- x[1]
    }
  }
  paste0(unlist(res),collapse = "")
}

such that

> sapply(df, f)
       col1        col2        col3        col4        col5        col6 
        "a"         "b"      "ccde"         "?" "hello you"  "helloyou"
       col7
    "hello"

Previous Answer

Here is a base R option using a custom function f to iterate through columns (I guess there must be some other packages or functionalities which are more efficient than mine, but I have no idea about them)

f <- function(v) {
  l <- min(nchar(v))
  res <- vector("list", l)
  for (i in 1:l) {
    x <- substr(v, i, i)
    if (length(unique(x)) == 1) {
      res[[i]] <- x[1]
    }
  }
  trimws(
    paste0(
    tapply(
      res,
      cumsum(c(0, diff(lengths(res)) != 0)),
      function(x) {
        u <- unlist(x)
        ifelse(length(u), paste0(u, collapse = ""), " ")
      }
    ),
    collapse = ""
  ))
}

and you will see

> sapply(df, f)
       col1        col2        col3        col4        col5        col6
        "a"         "b"      "ccde"         "?" "hello you" "hello you"
       col7
    "hello"
  • Related