Scrape webpage element with text or href criteria simultaneously-CodePudding

In the code below I can either write a function to pass to soup.find_all to search for regular expressions in the text or search with href keyword inside the reference.

from bs4 import BeautifulSoup
import re


# s is an example string. Scraping a webpage in reality.
s = """<tr>
            <td><a href="/-/media/markets-ops/rpm/rpm-auction-info/2023-2024/2023-2024-base-residual-auction-report.ashx" target="_blank">Report&nbsp;<i >PDF</i></a> | <a href="/-/media/markets-ops/rpm/rpm-auction-info/2023-2024/2023-2024-base-residual-auction-results.ashx" target="_blank">Results&nbsp;<i >XLS</i></a>&nbsp;</td>
            <td style="text-align: right;">6.21.2022</td>
        </tr>
        
    <tr>
            <td><strong>3rd Incremental Auction</strong><br>
            <a href="/-/media/markets-ops/rpm/rpm-auction-info/2021-2022/2021-2022-third-incremental-auction-report.ashx" target="_blank">Report&nbsp;<i >PDF</i></a>&nbsp;| <a href="/-/media/markets-ops/rpm/rpm-auction-info/2021-2022/2021-2022-third-incremental-auction-results.ashx" target="_blank">Results&nbsp;<i >XLS</i></a><br>
            <a href="/-/media/markets-ops/rpm/rpm-auction-info/2021-2022/2021-2022-third-incremental-auction-pre-auction-credit-calculator.ashx" target="_blank">Capacity Performance Pre-Auction Credit Calculator&nbsp;<i >XLS</i></a><br>
            <a href="/-/media/markets-ops/rpm/rpm-auction-info/2021-2022/2021-2022-third-ia-planning-parameters.ashx" target="_blank">Planning Parameters&nbsp;<i >XLS</i></a></td>
            <td style="text-align: right;"><br>
            3.5.2021<br>
            2.1.2021<br>
            <br>
            3.9.2021</td>
        </tr>"""


soup = BeautifulSoup(s, 'html.parser')

# 1) href search
gr = 'base-residual'
soup.find_all(href = re.compile(gr, re.IGNORECASE | re.DOTALL))

# 2) function search, this doesn't look inside the href  

def find_auction_results(tag):

    return tag.name == "tr" and bool(re.search("3rd Incremental Auction", tag.text, re.IGNORECASE | re.DOTALL))

soup.find_all(find_auction_results)

How can I do both in the same call, or in two different calls but with a single joined output? With the latter I can simply use list.extend() but what if there is an overlap? How would I in that case return a non-overlapping entity to process further? Can I search for a regular expression both in the text and in href simultaneously?

Expect: unique non-overlapping soup list. Get: two different calls.

CodePudding user response：

In this particular case, if you have html5lib parser, you can use CSS selectors

auction_results = soup.select('tr:-soup-contains("3rd Incremental Auction"), tr:has(*[href*="base-residual"])')

or, if you want to parametrize it a bit

inText, inAttr = "3rd Incremental Auction", "base-residual"
tag, attr = "tr", "href"
selector = f'{tag}:-soup-contains("{inText}"), tr:has(*[{attr}*="{inAttr}"])'
auction_results = soup.select(selector)

It's pretty much the same as

auction_results = soup.find_all(lambda t: t.name == tag and ((
    t.find(lambda c: c.get(attr) and inAttr in c.get(attr)) is not None
) or (t.text and inText in t.text)))

It's case-sensitive, but with find_all, you can also use re.search(inText, t.text, re.IGNORECASE | re.DOTALL) instead of inText in t.text (and similarly replace inAttr in c.get(attr) as well).

It's not a problem in this particular case, but if tag is something commonly nested like div [or even sometimes with tr in nested tables], parent tags with the same name will also be included in auction_results; but you can filter those out with

auction_results = [
    ar for ar in auction_results if not 
    [c for c in auction_results if ar in c.parents]
]

Or, you can also put it all in a function so that parent tags also get filtered out in one call with something like soup.find_all(lambda t: find_results(t, inText , inAttr))

def checkPat(pat, main): # for convenience
    return True if re.search(pat, main, re.IGNORECASE | re.DOTALL) else False

def find_results(tag, textPat, attrPat, tagName='tr', attrName='href'):     
    if tag.name != tagName: return False  
       
    hasText = checkPat(textPat, tag.text) if tag.text else False
    hasAttr = tag.find(lambda t: t.get(attrName) and checkPat(
        attrPat, t.get(attrName))) is not None
    
    if hasText or hasAttr:
        if not tag.find(lambda t: find_results(t,textPat,attrPat,tagName,attrName)): 
            return True
    return False

Or, if you'll always be using the same parameters and want to be able to just use something like soup.find_all(find_auction_results) without lambda

# def checkPat... # as before

def find_auction_results(tag):  
    textPat = "3rd Incremental Auction" 
    attrPat = "base-residual"
    tagName, attrName = "tr", "href"
    
    if tag.name != tagName: return False # has to be tr
       
    hasText = checkPat(textPat, tag.text) if tag.text else False
    hasAttr = tag.find(lambda t: t.get(attrName) and checkPat(
        attrPat, t.get(attrName))) is not None
    
    if hasText or hasAttr:
        # if nested like [ tr -> tr -> a ] , will only allow inner tr
        if not tag.find(find_auction_results): return True
    return False