import re
text = """State of California that the foregoing is true and correct. (For California sheriff or marshal use only) 1950-24-12 I certify that the foregoing is true and correct. Date: (SIGNATURE) SUBP-010 [Rev. January 1,2012] PROOF OF SERVICE OF DEPOSITION SUBPOENA FOR PRODUCTION OF BUSINESS RECORDS 055826-00-07 Page 2 of 2"""
pattern = re.findall("\d{2,4}[-]\d{1,2}[-]\d{1,2}",text)
print(pattern)
Required_output: 1950-24-12
The solution is taking 5826-00-07. Though it has more than 4 digit number. Is there any solution to remove it
CodePudding user response:
What you want is called negative lookbehind. This means only matching a pattern when the section directly behind the match does not match a given sequence. To give you an example of what this means, (?<!something)abc
will match any occurrence of "abc"
that does not directly get proceeded by "something"
.
So in your case, you want to add (?<!\d)
to the beginning of your regex to only match a pattern not proceeded by a digit.
Also, [-]
will only match the character -
so you don't need the brackets. After this change, the new regex is (?<!\d)\d{2,4}-\d{1,2}-\d{1,2}
.