Odd type error warning when using bs4 to obtain value from website-CodePudding

The following is a snippet from a website, where I am trying to obtain (only) the "Text to Capture". That text is surrounded by a couple of "div" classes, which contain tables, text etc.

<div >
    <div>Ranking 
        <div > ... </div>
        <div > ... </div>
        **Text to Capture**
        <span > of 5</span>
        <span >&nbsp;</span>
        <span >&nbsp;</span>
        <span >3</span> 
        <span >&nbsp;</span>
        <span >&nbsp;</span>
    </div>
</div>

The oddity here is that the text to capture has no Tags associated to it whatsoever. I have gotten this to work:

rankbox = soup.find('div', attrs={'class': 'rankbox'})
lx = [x for x in list(rankbox.contents[1])]
returnvalue = str(lx[4]).strip()

However, I am getting a type error warning from pycharm: Expected type 'Iterable[_T]' (matched generic type 'Iterable[_T]'), got 'PageElement' instead because rankbox.contents[1] is a PageElement, not a List

I am wondering whether there is a more elegant way of doing achieving this , avoiding a warning too

CodePudding user response：

Given this HTML source, the following is the a possible solution that I could think about.

The idea is

Get the first div tag under div.rankbox
Remove all div and span tags
Obtain text from the remaining source
Remove the text "Ranking" at the beginning
Remove surrounding spaces

import re
from bs4 import BeautifulSoup

html = """
<div >
    <div>Ranking 
        <div > ... </div>
        <div > ... </div>
        **Text to Capture**
        <span > of 5</span>
        <span >&nbsp;</span>
        <span >&nbsp;</span>
        <span >3</span> 
        <span >&nbsp;</span>
        <span >&nbsp;</span>
    </div>
</div>
"""

soup = BeautifulSoup(html)

x = soup.select("div.rankbox div")[0]  # div starting with Ranking
# remove all divs and spans
for d in x.find_all("div"):
    d.extract()
for s in x.find_all("span"):
    s.extract()
x = x.text
x = re.sub(r"^Ranking", "", x) # remove "Ranking" at first"
x = x.strip()

x
# '**Text to Capture**'

CodePudding user response：

Previous answer helped me to find the shortest code for this:

xtract = soup.find('div', attrs={'class': 'zr_rankbox'})
x = xtract.select('div')[0].find_all(text=True, recursive=False)[1].get_text(strip=True)

without type error warning