I dont (need to) care about performance!
My regex matches the date format dd.mm.yyyy
((([0][1-9]|[12][\d])|[3][01])[./]([0][13578]|[1][02])[./][1-9]\d\d\d)|((([0][1-9]|[12][\d])|[3][0])[./]([0][13456789]|[1][012])[./][1-9]\d\d\d)|(([0][1-9]|[12][\d])[-/][0][2][./][1-9]\d([02468][048]|[13579][26]))|(([0][1-9]|[12][0-8])[./][0][2][./][1-9]\d\d\d)
Here are the dates my regex does not match yet. Any help appreciated.
09. Juni 1997
01.Aug.1995
27.06. 1997
29.02.1996
21. 01. 1999
28.05. 1996
07..4..1995
20:03:1998
9.4.1997
14 .03 - 1995
I started out by trying to add the month letters but failed (probably because of the whitespaces between them)
here is a regex that validates the months' letter order (Januar, Februar, März, April, Mai, Juni, August, September, Oktober, November, Dezember)
(?:J(anuar|u(n|li))|Februar|Mä(rz|i)|A(pril|ugust)|(((Sept|Nov|Dez)em)|Okto)ber)
I found this on the internet, which focusses on the issue if only 3 letters of the months are availiable
(((([1-9])|([0][1-9])|([1-2][0-9])|(30))\-([A,a][P,p][R,r]|[J,j][U,u][N,n]|[S,s][E,e][P,p]|[N,n][O,o][V,v]))|((([1-9])|([0][1-9])|([1-2][0-9])|([3][0-1]))\-([J,j][A,a][N,n]|[M,m][A,a][R,r]|[M,m][A,a][Y,y]|[J,j][U,u][L,l]|[A,a][U,u][G,g]|[O,o][C,c][T,t]|[D,d][E,e][C,c])))\-[0-9]{4}$)|(^(([1-9])|([0][1-9])|([1][0-9])|([2][0-8]))\-([F,f][E,e][B,b])\-[0-9]{2}(([02468][1235679])|([13579][01345789]))$)|(^(([1-9])|([0][1-9])|([1][0-9])|([2][0-9]))\-([F,f][E,e][B,b])\-[0-9]{2}(([02468][048])|([13579][26]))
CodePudding user response:
You can use
pattern = r"""(?x)(?<!d)(?:
(?:(?:0?[1-9]|[12]\d)|3[01])\s?[./:-][\s.]?(?:0?[13578]|1[02]|J(?:an(?:uar)?|uli?)|M(?:ärz?|ai)|Aug(?:ust)?|Dez(?:ember)?|Okt(?:ober)?)\s?(?:[./:-][\s.]?)?[1-9]\d\d\d|
(?:(?:0?[1-9]|[12]\d)|30)\s?[./:-][\s.]?(?:0?[13-9]|1[012]|J(?:an(?:uar)?|u[nl]i?)|M(?:ärz?|ai)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|(?:Nov|Dez)(?:ember)?|Okt(?:ober)?)\s?(?:[./:-][\s.]?)?[1-9]\d\d\d|
(?:0?[1-9]|[12]\d)\s?[./:-][\s.]?(?:0?2|Fe(?:b(?:ruar)?)?)\s?(?:[./:-][\s.]?)?[1-9]\d(?:[02468][048]|[13579][26])|
(?:0?[1-9]|[12][0-8])\s?[./:-][\s.]?(?:0?2|Fe(?:b(?:ruar)?)?)\s?(?:[./:-][\s.]?)?[1-9]\d\d\d
)(?!\d)"""
See the regex demo.
Main POIs:
- The month regex is
(?:J(?:an(?:uar)?|u[nl]i?)|Fe(?:b(?:ruar)?)?|M(?:ärz?|ai)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|(?:Nov|Dez)(?:ember)?|Okt(?:ober)?)
and it is tested here. Adjust for shortenings as you see fit. Febraury
pattern is used separately for the last two alternations (they are specifically for Februrary) and is subtracted from the month pattern for the rest of the alternatives- From the first alternation, for 31-day months, February, April, June, September and November months are removed
- Leading zeros in days and months is made optional by adding
?
quantifier after0
- The separator between days and months is changed to
\s?[./:-][\s.]?
: an optional whitespace, a char from./:-
char set, and then an optional whitespace or.
- The separator between months and years is changed to
\s?(?:[./:-][\s.]?)?
: an optional whitespace and then an optional sequence of a char from./:-
char set and then an optional whitespace or.
.
The numeric boundaries, (?<!\d)
/ (?!\d)
, are added on both ends to make sure there are no other digits on both ends of the match.