I am using a Regex to pull dates out of a series of strings. The format varies slightly, but it always contains the full month. The strings usually contain two dates to represent a range like so:
February 1, 2020 - March 18, 2020
or
February 1st 2020 - March 18th 2020
And this is working great until I come across dates like:
June 1 - July 22, 2018
where a year is not presented in the "starting" part of the range because it is the same as the "ending" year.
Below is the Regex I crudely copied and applied to my code. It is Javascript but I really think this is more of a Regex question...
const regex = /((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)?)?((\s*[,.\-\/]\s*)\D?)?\s*((19[0-9]\d|20\d{2})|\d{2})*/gm;
var myDateString1 = "January 8, 2020 - January 27, 2020"; // THIS WORKS GREAT!
var myDateString2 = "January 8 - January 27, 2020"; // THIS DOES NOT WORK GREAT!
var dates = myDateString1.match(regex);
// returns ["January 8, 2020","January 27, 2020"]
var dates2 = myDateString2.match(regex);
// returns ["January 8 - J"]
Is there a way I can modify this so if it is met with a hyphen it discontinues that given match? So myDateString2
would return ["January 8", "January 27, 2020"]
?
The strings sometimes have words before or after, like
Presented from January 8, 2020 - January 27, 2020 at such and such place
so I don't think simply having a regex based on the hyphen before/after would work.
CodePudding user response:
You could use 2 capture groups and make the pattern more specific to match the format of the strings.
The /m
flag can be omitted as there are no anchors in the pattern.
Note that the pattern matches a date like pattern, and does not validate the date itself.
\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?(?:,\s \d{4})?)\s [,./-]\s \b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?,\s \d{4})\b
See a regex101 demo.
const regex = /\b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?(?:,\s \d{4})?)\s [,./-]\s \b((?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\s*\d\d?,\s \d{4})\b/g;
const str = `January 8, 2020 - January 27, 2020
January 8 - January 27, 2020
Presented from January 8, 2020 - January 27, 2020 at such and such place
June 1 - July 22, 2018`;
console.log(Array.from(str.matchAll(regex), m => [m[1], m[2]]))
CodePudding user response:
The capture groups were used for debug.
Simply taking out the hyphen in the class and making the year optional at the end with a single ?
should get what you want.
/((\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?)(\d{1,2}(st|nd|rd|th)?)?((\s*[,.\/]\s*)\D?)?(\s*(19[0-9]\d|20\d{2})|\d{2})?/
https://regex101.com/r/bMMzcR/1
And replacing the capture groups with clusters (?: )
then giving it one more level of factoring will make it quicker.
(?:\b\d{1,2}\D{0,3})?\b(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}(?:st|[nr]d|th)?)?(?:\s*[,./]\s*\D?)?(?:\s*(?:19[0-9]\d|20\d{2})|\d{2})?
https://regex101.com/r/Fq0dy2/1
const regex = /(?:\b\d{1,2}\D{0,3})?\b(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?)\D?(?:\d{1,2}(?:st|[nr]d|th)?)?(?:\s*[,./]\s*\D?)?(?:\s*(?:19[0-9]\d|20\d{2})|\d{2})?/g;
var myDateString1 = "January 8, 2020 - January 27, 2020";
var myDateString2 = "January 8 - January 27, 2020";
console.log(myDateString1.match(regex));
console.log(myDateString2.match(regex));