Extracting RegEx pattern across list excluding other html code-CodePudding

I've written a script to pull a list of available report url extensions page available for text extraction.

I've used parsing and BeautifulSoup to extract the reference area for the latest report using this method.

home = BeautifulSoup(home_url, 'html.parser')
container = home.find('div', attrs={'class': 'list'})
report_url_locations = list(x for x in container.findAll('a'))

This generates a list with each report and it's unique html extension, which is updated each time a new report is uploaded, for example:

[<a href="2022-05/13/c_76843.htm">May 16: Daily report</a>,
 <a href="2022-05/12/c_76842.htm">May 15: Daily report</a>,
 <a href="2022-05/11/c_76841.htm">May 14: Daily report</a>,
 <a href="2022-05/10/c_76839.htm">May 13: Daily report</a>]

I've managed to write some code to strip out html junk and keep just the extension for the first element (i.e. first report).

latest_sitrep_location = str(report_url_locations[0])
latest_sitrep_htm_location = re.search(r"[0-9] -[0-9] /[0-9] / c_[0-9] . htm",latest_sitrep_location)

This gives me:

"2022-05/13/c_76843.htm"

But when I try to do this for every element of the list it just throws me all the junk in-between:

all_urls= re.findall(r"[0-9] -[0-9] /[0-9] / c_[0-9] . htm", str(report_url_locations))
all_urls

['2022-05/13/c_76843.htm">May 16: Daily Report</a>, <a href="2022-05/12/c_76842.htm">May 15: Daily Report</a>, <a href="2022-05/11/c_76841.htm">May 14: Daily Report</a>, <a href="2022-05/10/c_76839.htm">May 13: Daily Report</a>]

But what I want is:

["2022-05/13/c_76843.htm","2022-05/12/c_76842.htm","2022-05/11/c_76841.htm","2022-05/10/c_76839.htm"]

Can somebody tell me what I need to include in my RegEx to ensure the other html is excluded? I'm fairly sure I need to convert every element in report_url_locations to be strings, but I don't know how to do this en-masse.

CodePudding user response：

Why don't you just try this:

report_url_locations = [x["href"] for x in container.findAll('a')]

And then just print the report_url_locations

By the way, here's why you shouldn't be using regex to parse an HTML.

CodePudding user response：

Edit: Don't use regex for HTML parsing, you know the drill.

If you're decided on using regex though, you could use r'(?:href=)\"(.*?)\"'.


text="""<a href="2022-05/13/c_76843.htm">May 16: Daily report</a>,
 <a href="2022-05/12/c_76842.htm">May 15: Daily report</a>,
 <a href="2022-05/11/c_76841.htm">May 14: Daily report</a>,
 <a href="2022-05/10/c_76839.htm">May 13: Daily report</a>
"""

re.findall(r'(?:href=)\"(.*?)\"', text)

Which outputs

['2022-05/13/c_76843.htm',
 '2022-05/12/c_76842.htm',
 '2022-05/11/c_76841.htm',
 '2022-05/10/c_76839.htm']