How do I remove numeric patterns of a certain length from a string in R-CodePudding

Say I have the string -

some_string <- "this is a string with some numbers 9639998 21057535 1000 2021 2022"

I would like to remove numeric patterns that are 7, characters long, 8 characters long, and 4 characters long, EXCEPT if it is 1000. So essentially I want the following result -

"this is a string with some numbers 1000"

CodePudding user response：

Use gsub here with the regex pattern \b(?:\d{7,8}|(?!1000\b)\d{4})\b:

some_string <- "this is a string with some numbers 9639998 21057535 1000 2021 2022"
output <- gsub("\\b(?:\\d{7,8}|(?!1000\\b)\\d{4})\\b", "", some_string, perl=TRUE)
output

[1] "this is a string with some numbers   1000  "

Actually, a better version, which tidies up loose whitespace, would be this:

some_string <- "this is a string with some numbers 9639998 21057535 1000 2021 2022"
output <- gsub("\\s*(?:\\d{7,8}|(?!1000\\b)\\d{4})\\s*", " ", some_string, perl=TRUE)
output <- gsub("^\\s |\\s $", "", gsub("\\s{2,}", " ", output))
output

[1] "this is a string with some numbers 1000"

CodePudding user response：

A stringr option to keep 1000 and lengths other than 4,7, and 8. (Included one of length 5 in the sample data.)

library(stringr)

"this is a string with some numbers 9639998 21057535 1000 2021 20022 2022" |> 
  str_remove_all("(?!1000)\\b(\\d{7,8}|\\d{4})\\b") |> 
  str_squish()
#> [1] "this is a string with some numbers 1000 20022"

^{Created on 2022-05-17 by the reprex package (v2.0.1)}