Is there a neat way to use re.findall() while removing the first 4 characters of all matches? (pytho-CodePudding

I'm extracting papers of arXiv by using the arXiv ids and using regex to help out with that. My current function is the following:

def get_arxiv_ids(bib_file_path):
    list_of_ids = []
    with open(bib_file_path, "r") as f:
        bib_string = f.read()

    arxiv_digits_list = re.findall(r"arXiv:\d{4}\.\d{4,5}", bib_string) <----
    for arxiv_id in arxiv_digits_list:
        list_of_ids.append(arxiv_id[6:])

    abs_digits_list = re.findall(r"abs/\d{4}\.\d{4,5}", bib_string) <---
    for abs_id in abs_digits_list:
        list_of_ids.append(abs_id[4:])

    print("Found {} arxiv ids in {}".format(len(list_of_ids), bib_file_path))
    return list_of_ids

I need to add arXiv: or abs/ otherwise I will extract some false positives. However, I was wondering if there was neater way to remove those characters from each match than to simply loop over each element. Not an issue with performance, but I was just curious.

CodePudding user response：

You can use a regular expression capture group to capture only the desired part of the pattern. Groups are created using parenthesis:

arxiv_digits_list = re.findall(r"arXiv:(\d{4}\.\d{4,5})", bib_string)
abs_digits_list = re.findall(r"abs/(\d{4}\.\d{4,5})", bib_string)

When used with a single capture group, re.findall() "return[s] a list of strings matching that group."

BONUS (EDIT)

You can also combine your two regexes into one:

re.findall(r"(?:arXiv:|abs/)(\d{4}\.\d{4,5})", bib_string)

This adds a non-capture group that matches "arXiv:" OR "abs/"