I've written a script to pull a list of available report url extensions page available for text extraction.
I've used parsing and BeautifulSoup to extract the reference area for the latest report using this method.
home = BeautifulSoup(home_url, 'html.parser')
container = home.find('div', attrs={'class': 'list'})
report_url_locations = list(x for x in container.findAll('a'))
This generates a list with each report and it's unique html extension, which is updated each time a new report is uploaded, for example:
[<a href="2022-05/13/c_76843.htm">May 16: Daily report</a>,
<a href="2022-05/12/c_76842.htm">May 15: Daily report</a>,
<a href="2022-05/11/c_76841.htm">May 14: Daily report</a>,
<a href="2022-05/10/c_76839.htm">May 13: Daily report</a>]
I've managed to write some code to strip out html junk and keep just the extension for the first element (i.e. first report).
latest_sitrep_location = str(report_url_locations[0])
latest_sitrep_htm_location = re.search(r"[0-9] -[0-9] /[0-9] / c_[0-9] . htm",latest_sitrep_location)
This gives me:
"2022-05/13/c_76843.htm"
But when I try to do this for every element of the list it just throws me all the junk in-between:
all_urls= re.findall(r"[0-9] -[0-9] /[0-9] / c_[0-9] . htm", str(report_url_locations))
all_urls
['2022-05/13/c_76843.htm">May 16: Daily Report</a>, <a href="2022-05/12/c_76842.htm">May 15: Daily Report</a>, <a href="2022-05/11/c_76841.htm">May 14: Daily Report</a>, <a href="2022-05/10/c_76839.htm">May 13: Daily Report</a>]
But what I want is:
["2022-05/13/c_76843.htm","2022-05/12/c_76842.htm","2022-05/11/c_76841.htm","2022-05/10/c_76839.htm"]
Can somebody tell me what I need to include in my RegEx to ensure the other html is excluded? I'm fairly sure I need to convert every element in report_url_locations to be strings, but I don't know how to do this en-masse.
CodePudding user response:
Why don't you just try this:
report_url_locations = [x["href"] for x in container.findAll('a')]
And then just print the report_url_locations
By the way, here's why you shouldn't be using regex
to parse an HTML.
CodePudding user response:
Edit: Don't use regex for HTML parsing, you know the drill.
If you're decided on using regex though, you could use r'(?:href=)\"(.*?)\"'
.
text="""<a href="2022-05/13/c_76843.htm">May 16: Daily report</a>,
<a href="2022-05/12/c_76842.htm">May 15: Daily report</a>,
<a href="2022-05/11/c_76841.htm">May 14: Daily report</a>,
<a href="2022-05/10/c_76839.htm">May 13: Daily report</a>
"""
re.findall(r'(?:href=)\"(.*?)\"', text)
Which outputs
['2022-05/13/c_76843.htm',
'2022-05/12/c_76842.htm',
'2022-05/11/c_76841.htm',
'2022-05/10/c_76839.htm']