I need to get the date month from various strings such as '14th oct', '14oct', '14.10', '14 10' and '14/10'. For these cases my below code working fine.
query = '14.oct'
print(re.search(r'(?P<date>\b\d{1,2})(?:\b|st|nd|rd|th)?(?:[\s\.\-/_\\,]*)(?P<month>\d{1,2}|[a-z]{3,9})', query, re.I).groupdict())
Result:-
{'date': '14', 'month': 'oct'}
But for this case (1410), its still capturing the date and month. But I don't want that, since this will be another number format of that entire string and not to be considered as date and month. The result should be None
.
How to change the search pattern for this? (with groupdict()
only)
CodePudding user response:
How to change the search pattern for this?
You might try using negative lookbehind assertion literal (
combined with negative lookahead assertion literal )
as follows
import re
query = '14.oct'
noquery = '(1410)'
print(re.search(r'(?<!\()(?P<date>\b\d{1,2})(?:\b|st|nd|rd|th)?(?:[\s\.\-/_\\,]*)(?P<month>\d{1,2}|[a-z]{3,9})(?!\))', query, re.I).groupdict())
print(re.search(r'(?<!\()(?P<date>\b\d{1,2})(?:\b|st|nd|rd|th)?(?:[\s\.\-/_\\,]*)(?P<month>\d{1,2}|[a-z]{3,9})(?!\))', noquery, re.I))
output
{'date': '14', 'month': 'oct'}
None
Beware that it does prevent all bracketed forms, i.e. not only (1410)
but also (14 10)
, (14/10)
and so on.
CodePudding user response:
Not sure if you don't want to match 1410
as in 4 digits only or (1410)
with the parenthesis, but to exclude matching both you can make sure there are not 4 consecutive digits:
(?P<date>\b(?!\d{4}\b)\d{1,2})(?:st|[nr]d|th)?[\s./_\\,-]*(?P<month>\d{1,2}|[a-z]{3,9})
To not match any date between parenthesis
\([^()]*\)|(?P<date>\b\d{1,2})(?:st|[nr]d|th)?[\s./_\\,-]*(?P<month>\d{1,2}|[a-z]{3,9})
\([^()]*\)
Match from opening till closing parenthesis|
Or(?P<date>\b\d{1,2})
Match 1-2 digits(?:st|[nr]d|th)?
Optionally matchst
nd
rd
th
[\s./_\\,-]*
Optionally repeat matching any of the listed(?P<month>\d{1,2}|[a-z]{3,9})
Match 1-2 digits or 3-9 chars a-z
For example
import re
pattern = r"\([^()]*\)|(?P<date>\b\d{1,2})(?:st|[nr]d|th)?(?:[\s./_\\,-]*)(?P<month>\d{1,2}|[a-z]{3,9})"
strings = ["14th oct", "14oct", "14.10", "14 10", "14/10", "1410", "(1410)"]
for s in strings:
m = re.search(pattern, s, re.I)
if m.group(1):
print(m.groupdict())
else:
print(f"{s} --> Not valid")
Output
{'date': '14', 'month': 'oct'}
{'date': '14', 'month': 'oct'}
{'date': '14', 'month': '10'}
{'date': '14', 'month': '10'}
{'date': '14', 'month': '10'}
{'date': '14', 'month': '10'}
(1410) --> Not valid