I'm new to regex and wasn't able to find the answer to my question. I didn't succeed in applying the answers to related questions like this to solve my problem.
I have a vector of several thousand 11-digit numeric strings. Strings are in the format XXXXX1992X0, where the first five digits are random numbers, the next four digits should indicate a year between 1982 and 2022, the penultimate digit is another random number, and the last digit should be between a 0 or 1.
I was able to match all strings meeting these specifications, but actually I want to match all strings not meeting these specifications (i.e. invalid four-digit numeric combinations where year should be, along with last digits not between 0 or 1).
So the valid string "33333201430" would not match, but invalid strings "33333201439", "33333999930", and "33333999939" would.
For reasons too complicated to explain here, I need this to be a regex solution, so unfortunately I can't just use !grepl()
.
MWE:
id <- c("33333201330", "33333201530", "33333201432","33333199834","33333199830", "33333333330","33333333333")
grepl("^(?:[0-9]{5})(1982|1983|1984|1985|1986|1987|1988|1989|1990|1991|1992|1993|1994|1995|1996|1997|1998|1999|2000|2001|2002|2003|2004|2005|2006|2007|2008|2009|2010|2011|2012|2013|2014|2015|2016|2017|2018|2019|2020|2021|2022)(?:[0-9]{1})(?:0|1{1})$", id)
Should actually return the opposite: FALSE FALSE TRUE TRUE FALSE TRUE TRUE
CodePudding user response:
You could try:
^(?![0-9]{5}(198[2-9]|199[0-9]|(20([01][0-9]|2[02])))[0-9][01]).*$
CodePudding user response:
How about this.
Explanation:
^
,$
beginning, end of string.{5}
match anything five times(.|.|. ...)
groups with an|
denoting or[2-9]
numbers from 2 to 9 one times[3-9]{4}
numbers from 2 to 9 four times203[0-9].[01]
matches 2030, 2031 ... 2030, then.
anyting, then 0 or 1
grepl(r"{^.{5}(.{5}[2-9]|[3-9]{4}[0-9]{2}|19[0-7][0-9].[01]|19[8][01].[01]|19[0-1].[01]|202[3-9].[01]|203[0-9].[01])$}", id)
# [1] FALSE FALSE TRUE TRUE FALSE TRUE TRUE
See the demo
I use raw string format r"{.}"
here, which avoids the need of double backslashes for escaping.
Data:
id <- c(33333320130, 33333320150, 33333320132, 33333319983, 33333319980,
33333333330, 33333319790, 33333320230, 33333319810, 33333319820,
33333320220, 33333320210)