In the code below I can either write a function to pass to soup.find_all
to search for regular expressions in the text or search with href
keyword inside the reference.
from bs4 import BeautifulSoup
import re
# s is an example string. Scraping a webpage in reality.
s = """<tr>
<td><a href="/-/media/markets-ops/rpm/rpm-auction-info/2023-2024/2023-2024-base-residual-auction-report.ashx" target="_blank">Report <i >PDF</i></a> | <a href="/-/media/markets-ops/rpm/rpm-auction-info/2023-2024/2023-2024-base-residual-auction-results.ashx" target="_blank">Results <i >XLS</i></a> </td>
<td style="text-align: right;">6.21.2022</td>
</tr>
<tr>
<td><strong>3rd Incremental Auction</strong><br>
<a href="/-/media/markets-ops/rpm/rpm-auction-info/2021-2022/2021-2022-third-incremental-auction-report.ashx" target="_blank">Report <i >PDF</i></a> | <a href="/-/media/markets-ops/rpm/rpm-auction-info/2021-2022/2021-2022-third-incremental-auction-results.ashx" target="_blank">Results <i >XLS</i></a><br>
<a href="/-/media/markets-ops/rpm/rpm-auction-info/2021-2022/2021-2022-third-incremental-auction-pre-auction-credit-calculator.ashx" target="_blank">Capacity Performance Pre-Auction Credit Calculator <i >XLS</i></a><br>
<a href="/-/media/markets-ops/rpm/rpm-auction-info/2021-2022/2021-2022-third-ia-planning-parameters.ashx" target="_blank">Planning Parameters <i >XLS</i></a></td>
<td style="text-align: right;"><br>
3.5.2021<br>
2.1.2021<br>
<br>
3.9.2021</td>
</tr>"""
soup = BeautifulSoup(s, 'html.parser')
# 1) href search
gr = 'base-residual'
soup.find_all(href = re.compile(gr, re.IGNORECASE | re.DOTALL))
# 2) function search, this doesn't look inside the href
def find_auction_results(tag):
return tag.name == "tr" and bool(re.search("3rd Incremental Auction", tag.text, re.IGNORECASE | re.DOTALL))
soup.find_all(find_auction_results)
How can I do both in the same call, or in two different calls but with a single joined output? With the latter I can simply use list.extend()
but what if there is an overlap? How would I in that case return a non-overlapping entity to process further? Can I search for a regular expression both in the text and in href simultaneously?
Expect: unique non-overlapping soup list. Get: two different calls.
CodePudding user response:
In this particular case, if you have html5lib
parser, you can use CSS selectors
auction_results = soup.select('tr:-soup-contains("3rd Incremental Auction"), tr:has(*[href*="base-residual"])')
or, if you want to parametrize it a bit
inText, inAttr = "3rd Incremental Auction", "base-residual"
tag, attr = "tr", "href"
selector = f'{tag}:-soup-contains("{inText}"), tr:has(*[{attr}*="{inAttr}"])'
auction_results = soup.select(selector)
It's pretty much the same as
auction_results = soup.find_all(lambda t: t.name == tag and ((
t.find(lambda c: c.get(attr) and inAttr in c.get(attr)) is not None
) or (t.text and inText in t.text)))
It's case-sensitive, but with find_all
, you can also use re.search(inText, t.text, re.IGNORECASE | re.DOTALL)
instead of inText in t.text
(and similarly replace inAttr in c.get(attr)
as well).
It's not a problem in this particular case, but if tag
is something commonly nested like div
[or even sometimes with tr
in nested tables], parent tags with the same name will also be included in auction_results
; but you can filter those out with
auction_results = [
ar for ar in auction_results if not
[c for c in auction_results if ar in c.parents]
]
Or, you can also put it all in a function so that parent tags also get filtered out in one call with something like soup.find_all(lambda t: find_results(t, inText , inAttr))
def checkPat(pat, main): # for convenience
return True if re.search(pat, main, re.IGNORECASE | re.DOTALL) else False
def find_results(tag, textPat, attrPat, tagName='tr', attrName='href'):
if tag.name != tagName: return False
hasText = checkPat(textPat, tag.text) if tag.text else False
hasAttr = tag.find(lambda t: t.get(attrName) and checkPat(
attrPat, t.get(attrName))) is not None
if hasText or hasAttr:
if not tag.find(lambda t: find_results(t,textPat,attrPat,tagName,attrName)):
return True
return False
Or, if you'll always be using the same parameters and want to be able to just use something like soup.find_all(find_auction_results)
without lambda
# def checkPat... # as before
def find_auction_results(tag):
textPat = "3rd Incremental Auction"
attrPat = "base-residual"
tagName, attrName = "tr", "href"
if tag.name != tagName: return False # has to be tr
hasText = checkPat(textPat, tag.text) if tag.text else False
hasAttr = tag.find(lambda t: t.get(attrName) and checkPat(
attrPat, t.get(attrName))) is not None
if hasText or hasAttr:
# if nested like [ tr -> tr -> a ] , will only allow inner tr
if not tag.find(find_auction_results): return True
return False