I'm extracting papers of arXiv by using the arXiv ids and using regex to help out with that. My current function is the following:
def get_arxiv_ids(bib_file_path):
list_of_ids = []
with open(bib_file_path, "r") as f:
bib_string = f.read()
arxiv_digits_list = re.findall(r"arXiv:\d{4}\.\d{4,5}", bib_string) <----
for arxiv_id in arxiv_digits_list:
list_of_ids.append(arxiv_id[6:])
abs_digits_list = re.findall(r"abs/\d{4}\.\d{4,5}", bib_string) <---
for abs_id in abs_digits_list:
list_of_ids.append(abs_id[4:])
print("Found {} arxiv ids in {}".format(len(list_of_ids), bib_file_path))
return list_of_ids
I need to add arXiv:
or abs/
otherwise I will extract some false positives. However, I was wondering if there was neater way to remove those characters from each match than to simply loop over each element. Not an issue with performance, but I was just curious.
CodePudding user response:
You can use a regular expression capture group to capture only the desired part of the pattern. Groups are created using parenthesis:
arxiv_digits_list = re.findall(r"arXiv:(\d{4}\.\d{4,5})", bib_string)
abs_digits_list = re.findall(r"abs/(\d{4}\.\d{4,5})", bib_string)
When used with a single capture group, re.findall()
"return[s] a list of strings matching that group."
BONUS (EDIT)
You can also combine your two regexes into one:
re.findall(r"(?:arXiv:|abs/)(\d{4}\.\d{4,5})", bib_string)
This adds a non-capture group that matches "arXiv:" OR "abs/"