Home > Software design >  Regex Match on String (DOI)
Regex Match on String (DOI)

Time:08-03

Hi I'm struggling to understand why my Regex isn't working.

I have URL's that have DOI's on them like so:

https://link.springer.com/10.1007/s00737-021-01116-5
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://journals.sagepub.com/doi/pdf/10.1177/1078390319877228
https://onlinelibrary.wiley.com/doi/10.1111/jocn.13435
https://journals.sagepub.com/doi/pdf/10.1177/1062860613484171
https://onlinelibrary.wiley.com/resolve/openurl?genre=article&title=Natural Resources Forum&issn=0165-0203&volume=26&date=2002&issue=1&spage=3
https://dx.doi.org/10.1108/14664100110397304?nols=y
https://onlinelibrary.wiley.com/doi/10.1111/jocn.15833
https://www.tandfonline.com/doi/pdf/10.1080/03768350802090592?needAccess=true

And I'm using for example this Regex, but it always returns empty?

print(re.findall(r'/^10.\d{4,9}/[-._;()/:A-Z0-9] $/i', 'https://dx.doi.org/10.1108/02652320410549638?nols=y'))

Where have I gone wrong?

CodePudding user response:

It looks like you come from another programming language that has the notion of regex literals that are delimited with forward slashes and have the modifiers following the closing slash (hence /i).

In Python there is no such thing, and these slashes and modifier(s) are taken as literal characters. For flags like i you can use the optional flags parameter of findall.

Secondly, ^ will match the start of the input string, but evidently the URLs you have as input do not start with 10, so that has to go. Instead you could require that the 10 must follow a word break... i.e. it should not be preceded by an alphanumerical character (or underscore).

Similarly, $ will match the end of the input string, but you have URLs that continue with URL parameters, like ?nols=y, so again the part you are interested in does not go on until the end of the input. So that has to go too.

The dot has a special meaning in regex, but you clearly intended to match a literal dot, so it should be escaped.

Finally, alphanumerical characters can be matched with \w, which also matches both lower case and capital Latin letters, so you can shorten the character class a bit and do without any flags such as i (re.I).

This leaves us with:

print(re.findall(r'\b10\.\d{4,9}/[-.;()/:\w] ', 
                'https://dx.doi.org/10.1108/02652320410549638?nols=y'))
  • Related