I'm trying to mine a text into a list using re
.
Here is what I've written:
dateStr = "20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009"
regex = r'(?:\d{1,2}[/-]*)?(?:Mar)?[a-z\s,.]*(?:\d{1,2}[/-]*) (?:\d{2,4}) '
result = re.findall(regex, dateStr)
Even if I stated (?:\d{1,2}[/-]*)
at the beginning of the expression, I'm missing the days digits. Here is what I get
:
['Mar 2009', 'March 2009', 'Mar. 2009', 'March, 2009']
Could you help? Thanks
Edit:
This question was solved through the comments.
Original assignment string:
dateStr = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
CodePudding user response:
I would use:
dateStr = "20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009"
dt = re.findall(r'\d{1,2} \w [,.]? \d{4}', dateStr)
print(dt) # ['20 Mar 2009', '20 March 2009', '20 Mar. 2009', '20 March, 2009']
The one size fits all regex pattern used above says to match:
\d{1,2} a one or two digit day
[ ] space
\w month name or abbreviation
[,.]? possibly followed by comma or period
[ ] space
\d{4} four digit year
CodePudding user response:
One of the many approaches:
import re
dateStr = "20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009"
regex = r'[0-9]{1,2}\s[a-zA-Z] [.,]*\s[0-9]{4}'
result = re.findall(regex, dateStr)
print (result)
Output:
['20 Mar 2009', '20 March 2009', '20 Mar. 2009', '20 March, 2009']