Reusable Beautiful Soup Parser/Config?-CodePudding

I have a Selenium-based web crawler application that monitors over 100 different medical publications, with more being added regularly. Each of these publications has a different site structure, so I've tried to make the web crawler as general and re-usable as possible (especially because this is intended for use by other colleagues). For each crawler, the user specifies a list of regex URL patterns that the crawler is allowed to crawl. From there, the crawler will grab any links found as well as specified sections of the HTML. This has been useful in downloading large amounts of content in a fraction of the time it would take to do manually.

I'm now trying to figure out a way to generate custom reports based on the HTML of a certain page. For example, when crawling X site, export a JSON file that shows the number of issues on the page, the name of each issue, the number of articles under each issue, then the title and author names of each of those articles. The page I'll use as an example and test case is

CodePudding user response：

With BeautifulSoup, I strongly prefer using select and select_one to using find_all and find when scraping nested elements. (If you're not used to working with CSS selectors, I find the w3schools reference page to be a good cheatsheet for them.)

If you defined a function like

def getSoupData(mSoup, dataStruct, maxDepth=None, curDepth=0):
    if type(dataStruct) != dict:
        # so selector/targetAttr can also be sent as a single string 
        if str(dataStruct).startswith('"ta":'):
            dKey = 'targetAttr'
        else:
            dKey = 'cssSelector'
        dataStruct = str(dataStruct).replace('"ta":', '', 1)
        dataStruct = {dKey: dataStruct}

    # default values: isList=False, items={}
    isList = dataStruct['isList'] if 'isList' in dataStruct else False
    if 'items' in dataStruct and type(dataStruct['items']) == dict:
        items = dataStruct['items']
    else: items = {}

    # no selector -> just use the input directly
    if 'cssSelector' not in dataStruct:
        soup = mSoup if type(mSoup) == list else [mSoup]
    else:
        soup = mSoup.select(dataStruct['cssSelector'])
        # so that unneeded parts are not processed:
        if not isList: soup = soup[:1]

    # return empty nothing was selected
    if not soup: return [] if isList else None

    # return text or attribute values - no more recursion
    if items == {}:
        if 'targetAttr' in dataStruct:
            targetAttr = dataStruct['targetAttr']
        else:
            targetAttr = '"text"'  # default

        if targetAttr == '"text"':
            sData = [s.get_text(strip=True) for s in soup]
        # can put in more options with elif
        else:
            sData = [s.get(targetAttr) for s in soup]

        return sData if isList else sData[0]

    # return error - recursion limited
    if maxDepth is not None and curDepth > maxDepth:
        return {
            'errorMsg': f'Maximum [{maxDepth}] exceeded at depth={curDepth}'
        }

    # recursively get items
    sData = [dict([(i, getSoupData(
        s, items[i], maxDepth, curDepth   1
    )) for i in items]) for s in soup]

    return sData if isList else sData[0]
    # return list only if isList is set

you can make your data structure as nested as your html structure [because the function is recursive]....if you want that for some reason; but also, you can set maxDepth to limit how nested it can get - if you don't want to set any limits, you can get rid of both maxDepth and curDepth as well as any parts involving them.

Then, you can make your config file something like

{
    "name": "Paedriactia",
    "data_structure": {
        "cssSelector": "div.section__inner",
        "items": {
            "Issue": "p.section__spitzmarke",
            "Articles": {
                "cssSelector": "article",
                "items": {
                    "Title": "h3.teaser__title",
                    "Author": "p.teaser__authors" 
                },
                "isList": true
            }
        },
        "isList": true
    }
    "url": "https://www.paediatrieschweiz.ch/zeitschriften/"
}

["isList": true here is equivalent to your "find_all": true; and your "subunits" are also defined as "items" here - the function can differentiate based on the structure/dataType.]

Now the same data that you showed [at the beginning of your question] can be extracted with

# import json
configC = json.load(open('crawlerConfig_paedriactia.json', 'r'))

url = configC['url']
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')

dStruct = configC['data_structure']
getSoupData(soup, dStruct)

For this example, you could add the article links by adding {"cssSelector": "a.teaser__inner", "targetAttr": "href"} as ...Articles.items.Link.

Also, note that [because of the defaults set at the beginning in the function], "Title": "h3.teaser__title" is the same as

"Title": { "cssSelector": "h3.teaser__title" }

and

"Link": "\"ta\":href"

would be the same as

"Link": {"targetAttr": "href"}