Home > Back-end >  regex: match any numeric string not matching exact format
regex: match any numeric string not matching exact format

Time:02-28

I'm new to regex and wasn't able to find the answer to my question. I didn't succeed in applying the answers to related questions like this to solve my problem.

I have a vector of several thousand 11-digit numeric strings. Strings are in the format XXXXX1992X0, where the first five digits are random numbers, the next four digits should indicate a year between 1982 and 2022, the penultimate digit is another random number, and the last digit should be between a 0 or 1.

I was able to match all strings meeting these specifications, but actually I want to match all strings not meeting these specifications (i.e. invalid four-digit numeric combinations where year should be, along with last digits not between 0 or 1).

So the valid string "33333201430" would not match, but invalid strings "33333201439", "33333999930", and "33333999939" would.

For reasons too complicated to explain here, I need this to be a regex solution, so unfortunately I can't just use !grepl().

MWE:

id <- c("33333201330", "33333201530", "33333201432","33333199834","33333199830", "33333333330","33333333333")
  
grepl("^(?:[0-9]{5})(1982|1983|1984|1985|1986|1987|1988|1989|1990|1991|1992|1993|1994|1995|1996|1997|1998|1999|2000|2001|2002|2003|2004|2005|2006|2007|2008|2009|2010|2011|2012|2013|2014|2015|2016|2017|2018|2019|2020|2021|2022)(?:[0-9]{1})(?:0|1{1})$", id)

Should actually return the opposite: FALSE FALSE TRUE TRUE FALSE TRUE TRUE

CodePudding user response:

You could try:

^(?![0-9]{5}(198[2-9]|199[0-9]|(20([01][0-9]|2[02])))[0-9][01]).*$

CodePudding user response:

How about this.

Explanation:

  • ^, $ beginning, end of string
  • .{5} match anything five times
  • (.|.|. ...) groups with an | denoting or
  • [2-9] numbers from 2 to 9 one times
  • [3-9]{4} numbers from 2 to 9 four times
  • 203[0-9].[01] matches 2030, 2031 ... 2030, then . anyting, then 0 or 1

grepl(r"{^.{5}(.{5}[2-9]|[3-9]{4}[0-9]{2}|19[0-7][0-9].[01]|19[8][01].[01]|19[0-1].[01]|202[3-9].[01]|203[0-9].[01])$}", id)
# [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

See the demo

I use raw string format r"{.}" here, which avoids the need of double backslashes for escaping.


Data:

id <- c(33333320130, 33333320150, 33333320132, 33333319983, 33333319980, 
33333333330, 33333319790, 33333320230, 33333319810, 33333319820, 
33333320220, 33333320210)
  • Related