How can I get the characters that repeat at the same position in a string?
df <- data.frame(col1 = paste0("a", LETTERS[1:5]),
col2 = paste0("b", letters[1:5]),
col3 = paste0("ccde", letters[1:5]),
col4 = paste0('?', letters[1:2], 1:5),
col5 = paste0(1:5, 'hello you', letters[1:2], 1:5),
col6 = paste0('hello', letters[1:2], 1:5, "you"),
col7 = c("hello1 you", "hello you2", "hello3 you", "hello you4", "hello5 you"))
# col1 col2 col3 col4 col5 col6 col7
# 1 aA ba ccdea ?a1 1hello youa1 helloa1you hello1 you
# 2 aB bb ccdeb ?b2 2hello youb2 hellob2you hello you2
# 3 aC bc ccdec ?a3 3hello youa3 helloa3you hello3 you
# 4 aD bd ccded ?b4 4hello youb4 hellob4you hello you4
# 5 aE be ccdee ?a5 5hello youa5 helloa5you hello5 you
result <- c("a", "b", "ccde", "?", "hello you", "helloyou", "hello")
CodePudding user response:
Here is one possible approach. Convert the strings to raw, test the first against the rest for equality, reduce the result, and use it to subset the first before reconverting it back to character. You could do the same approach with strsplit()
but I believe that would be slightly less efficient.
f <- function(x) {
cv <- charToRaw(x[1])
rawToChar(cv[Reduce(`&`, lapply(x[-1], \(y) cv == charToRaw(y)))])
}
sapply(df, f2)
# col1 col2 col3 col4 col5 col6 col7
# "a" "b" "ccde" "?" "hello you" "helloyou" "hello"
CodePudding user response:
Update
As per your update on the expected output, it seems we can simplify f
a little bit
f <- function(v) {
l <- min(nchar(v))
res <- vector("list", l)
for (i in 1:l) {
x <- substr(v, i, i)
if (length(unique(x)) == 1) {
res[[i]] <- x[1]
}
}
paste0(unlist(res),collapse = "")
}
such that
> sapply(df, f)
col1 col2 col3 col4 col5 col6
"a" "b" "ccde" "?" "hello you" "helloyou"
col7
"hello"
Previous Answer
Here is a base R option using a custom function f
to iterate through columns (I guess there must be some other packages or functionalities which are more efficient than mine, but I have no idea about them)
f <- function(v) {
l <- min(nchar(v))
res <- vector("list", l)
for (i in 1:l) {
x <- substr(v, i, i)
if (length(unique(x)) == 1) {
res[[i]] <- x[1]
}
}
trimws(
paste0(
tapply(
res,
cumsum(c(0, diff(lengths(res)) != 0)),
function(x) {
u <- unlist(x)
ifelse(length(u), paste0(u, collapse = ""), " ")
}
),
collapse = ""
))
}
and you will see
> sapply(df, f)
col1 col2 col3 col4 col5 col6
"a" "b" "ccde" "?" "hello you" "hello you"
col7
"hello"