Home > Mobile >  Odd type error warning when using bs4 to obtain value from website
Odd type error warning when using bs4 to obtain value from website

Time:03-27

The following is a snippet from a website, where I am trying to obtain (only) the "Text to Capture". That text is surrounded by a couple of "div" classes, which contain tables, text etc.

<div >
    <div>Ranking 
        <div > ... </div>
        <div > ... </div>
        **Text to Capture**
        <span > of 5</span>
        <span >&nbsp;</span>
        <span >&nbsp;</span>
        <span >3</span> 
        <span >&nbsp;</span>
        <span >&nbsp;</span>
    </div>
</div>

The oddity here is that the text to capture has no Tags associated to it whatsoever. I have gotten this to work:

rankbox = soup.find('div', attrs={'class': 'rankbox'})
lx = [x for x in list(rankbox.contents[1])]
returnvalue = str(lx[4]).strip()

However, I am getting a type error warning from pycharm: Expected type 'Iterable[_T]' (matched generic type 'Iterable[_T]'), got 'PageElement' instead because rankbox.contents[1] is a PageElement, not a List

I am wondering whether there is a more elegant way of doing achieving this , avoiding a warning too

CodePudding user response:

Given this HTML source, the following is the a possible solution that I could think about.

The idea is

  1. Get the first div tag under div.rankbox
  2. Remove all div and span tags
  3. Obtain text from the remaining source
  4. Remove the text "Ranking" at the beginning
  5. Remove surrounding spaces
import re
from bs4 import BeautifulSoup

html = """
<div >
    <div>Ranking 
        <div > ... </div>
        <div > ... </div>
        **Text to Capture**
        <span > of 5</span>
        <span >&nbsp;</span>
        <span >&nbsp;</span>
        <span >3</span> 
        <span >&nbsp;</span>
        <span >&nbsp;</span>
    </div>
</div>
"""

soup = BeautifulSoup(html)

x = soup.select("div.rankbox div")[0]  # div starting with Ranking
# remove all divs and spans
for d in x.find_all("div"):
    d.extract()
for s in x.find_all("span"):
    s.extract()
x = x.text
x = re.sub(r"^Ranking", "", x) # remove "Ranking" at first"
x = x.strip()

x
# '**Text to Capture**'

CodePudding user response:

Previous answer helped me to find the shortest code for this:

xtract = soup.find('div', attrs={'class': 'zr_rankbox'})
x = xtract.select('div')[0].find_all(text=True, recursive=False)[1].get_text(strip=True)

without type error warning

  • Related