Suppose i have following id
74876593476
74877777777
74884784633
74822228765
74878645421
74820201111
i want to ignore any number contain more than 3 repeated numbers respectively, then the expected result is:
74876593476
74884784633
74878645421
74876593476
CodePudding user response:
Using the regex from this post, you may use grep
-
x <- c(74876593476, 74877777777, 74884784633, 74822228765, 74878645421, 74820201111)
grep('(\\d)\\1\\1\\1', x, invert = TRUE, value = TRUE)
#[1] "74876593476" "74884784633" "74878645421"
Or if you are a tidyverse
fan, you can use str_subset
from stringr
with the same regex.
stringr::str_subset(x, '(\\d)\\1\\1\\1', negate = TRUE)
#[1] "74876593476" "74884784633" "74878645421"
This will remove numbers that occur more than 3 consecutive times.
CodePudding user response:
We can try grepl
to subset x
> x <- c(74876593476, 74877777777, 74884784633, 74822228765, 74878645421, 74820201111)
> x[!grepl("(\\d)\\1{3}", x)]
[1] 74876593476 74884784633 74878645421
CodePudding user response:
The regex answer is probably better, but here's an alternative using strsplit
and rle
.
x <- c(74876593476, 74877777777, 74884784633, 74822228765, 74878645421, 74820201111)
x[sapply(strsplit(as.character(x),""),\(x)!any(rle(x)$lengths>3))]
#[1] 74876593476 74884784633 74878645421
CodePudding user response:
A solution that avoids converting to characters.
fNoRep <- function(x, k = 3L) {
n <- ceiling(log10(x)) 1L
# get the digits as integers, plus an extra digit for each value
i <- as.integer((rep.int(x, n)/10^sequence(n, 0))%)
# set the extra digit to 10 in order to separate the values
i[cs <- cumsum(n)] <- 10L
# use rle to find runs longer than k
lens <- rle(i)$lengths
x[-unique(findInterval(cumsum(lens)[which(lens > k)], cs)) - 1L]
}
x <- c(74876593476, 74877777777, 74884784633, 74822228765, 74878645421, 74820201111, 91526000000)
fNoRep(x)
#> [1] 74876593476 74884784633 74878645421
Compare to the grep
solution, which doesn't remove values with trailing zeros.
fNoRepGrep <- function(x, k = 3L) as.numeric(grep(sprintf("(\\d)\\1{%d}", k), x, invert = TRUE, value = TRUE))
fNoRepGrep(x)
#> [1] 74876593476 74884784633 74878645421 91526000000
The math-based solution is about twice as fast as the grep
solution.
x <- sample(1e10:(1e11 - 1), 1e4)
microbenchmark::microbenchmark(math = fNoRep(x),
grep = fNoRepGrep(x))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> math 7.5738 9.03255 10.30973 9.38525 11.81905 16.7631 100
#> grep 19.9207 20.19140 20.67160 20.48535 20.94270 23.1786 100
CodePudding user response:
Convert them to strings, use the grep() function to detect the triple repeated digits then filter out any entries where it's contained. Finally, use filter from dplyr to remove all cases where the triple digits were matched. The use of | in the call to grepl() allows multiple valid strings to be used
library(tidyverse)
library(rlang)
data <- tibble(id=as.character(c(74876593476,74877777777,11111,74884784633,74822228765,74878645421,74820201111)))
output <- data %>% mutate(triple=grepl(x=id,pattern="111|222|777")) %>%
filter(triple==FALSE)