Hi I'm struggling to understand why my Regex isn't working.
I have URL's that have DOI's on them like so:
https://link.springer.com/10.1007/s00737-021-01116-5
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://onlinelibrary.wiley.com/doi/10.1111/jocn.13435
https://journals.sagepub.com/doi/pdf/10.1177/1062860613484171
https://onlinelibrary.wiley.com/resolve/openurl?genre=article&title=Natural Resources Forum&issn=0165-0203&volume=26&date=2002&issue=1&spage=3
https://dx.doi.org/10.1108/14664100110397304?nols=y
https://onlinelibrary.wiley.com/doi/10.1111/jocn.15833
https://www.tandfonline.com/doi/pdf/10.1080/03768350802090592?needAccess=true
And I'm using for example this Regex, but it always returns empty?
print(re.findall(r'/^10.\d{4,9}/[-._;()/:A-Z0-9] $/i', 'https://dx.doi.org/10.1108/02652320410549638?nols=y'))
Where have I gone wrong?
CodePudding user response:
It looks like you come from another programming language that has the notion of regex literals that are delimited with forward slashes and have the modifiers following the closing slash (hence /i
).
In Python there is no such thing, and these slashes and modifier(s) are taken as literal characters. For flags like i
you can use the optional flags
parameter of findall
.
Secondly, ^
will match the start of the input string, but evidently the URLs you have as input do not start with 10
, so that has to go. Instead you could require that the 10
must follow a word break... i.e. it should not be preceded by an alphanumerical character (or underscore).
Similarly, $
will match the end of the input string, but you have URLs that continue with URL parameters, like ?nols=y
, so again the part you are interested in does not go on until the end of the input. So that has to go too.
The dot has a special meaning in regex, but you clearly intended to match a literal dot, so it should be escaped.
Finally, alphanumerical characters can be matched with \w
, which also matches both lower case and capital Latin letters, so you can shorten the character class a bit and do without any flags such as i
(re.I
).
This leaves us with:
print(re.findall(r'\b10\.\d{4,9}/[-.;()/:\w] ',
'https://dx.doi.org/10.1108/02652320410549638?nols=y'))