I have paragraph and from it I want to extract the arxiv dois. For example, this is the given paragraph:
"Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
Further information can be referred to this [arXiv article]`(https://arxiv.org/abs/2109.05857).`"
The output should be: https://arxiv.org/abs/2109.05857
.
The start of arxiv doi will always be "https://arxiv.org" or "arxiv.org" and the appeded string could be anything.
I tried exp = re.findall("^https://arxiv.org*", str)
but it doesn't work.
Any help will be appreciating.
CodePudding user response:
First thing, the "^" is wrong - as that would only match text at the begining of the line. The * after the g also does not mean "anything" - but "any number of g".
You could try with re_res=re.findall("https://arxiv.org/[^\s\)\]]*[0-9]"
It is not fool proof but should cover most cases.
find https://arxiv.org/
that is followed by anything (*
) not ([^...]
space \s
closing round or square brackets(\)\]
) (together [^\\s\\\)\\]]*
) but ends whith a number ([0-9]
).
CodePudding user response:
import re
string = """Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.
Further information can be referred to this [arXiv article]`(https://arxiv.org/abs/2109.05857).`"""
pattern=re.compile('\((https://arxiv.org[^\)] |arxiv.org[^\)] )', re.MULTILINE)
pattern.findall(string)
#output
#['https://arxiv.org/abs/2109.05857']