Home > OS >  Matching arxiv regular expression in Python
Matching arxiv regular expression in Python

Time:11-17

I have paragraph and from it I want to extract the arxiv dois. For example, this is the given paragraph:

"Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.

Further information can be referred to this [arXiv article]`(https://arxiv.org/abs/2109.05857).`"

The output should be: https://arxiv.org/abs/2109.05857. The start of arxiv doi will always be "https://arxiv.org" or "arxiv.org" and the appeded string could be anything.

I tried exp = re.findall("^https://arxiv.org*", str) but it doesn't work.

Any help will be appreciating.

CodePudding user response:

First thing, the "^" is wrong - as that would only match text at the begining of the line. The * after the g also does not mean "anything" - but "any number of g".

You could try with re_res=re.findall("https://arxiv.org/[^\s\)\]]*[0-9]"

It is not fool proof but should cover most cases.

find https://arxiv.org/ that is followed by anything (*) not ([^...] space \s closing round or square brackets(\)\]) (together [^\\s\\\)\\]]*) but ends whith a number ([0-9]).

CodePudding user response:

import re

string = """Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
Further information can be referred to this [arXiv article]`(https://arxiv.org/abs/2109.05857).`"""

pattern=re.compile('\((https://arxiv.org[^\)] |arxiv.org[^\)] )', re.MULTILINE)
pattern.findall(string)

#output
#['https://arxiv.org/abs/2109.05857']
  • Related