The following is a snippet from a website, where I am trying to obtain (only) the "Text to Capture". That text is surrounded by a couple of "div" classes, which contain tables, text etc.
<div >
<div>Ranking
<div > ... </div>
<div > ... </div>
**Text to Capture**
<span > of 5</span>
<span > </span>
<span > </span>
<span >3</span>
<span > </span>
<span > </span>
</div>
</div>
The oddity here is that the text to capture has no Tags associated to it whatsoever. I have gotten this to work:
rankbox = soup.find('div', attrs={'class': 'rankbox'})
lx = [x for x in list(rankbox.contents[1])]
returnvalue = str(lx[4]).strip()
However, I am getting a type error warning from pycharm: Expected type 'Iterable[_T]' (matched generic type 'Iterable[_T]'), got 'PageElement' instead because rankbox.contents[1] is a PageElement, not a List
I am wondering whether there is a more elegant way of doing achieving this , avoiding a warning too
CodePudding user response:
Given this HTML source, the following is the a possible solution that I could think about.
The idea is
- Get the first
div
tag underdiv.rankbox
- Remove all
div
andspan
tags - Obtain text from the remaining source
- Remove the text "Ranking" at the beginning
- Remove surrounding spaces
import re
from bs4 import BeautifulSoup
html = """
<div >
<div>Ranking
<div > ... </div>
<div > ... </div>
**Text to Capture**
<span > of 5</span>
<span > </span>
<span > </span>
<span >3</span>
<span > </span>
<span > </span>
</div>
</div>
"""
soup = BeautifulSoup(html)
x = soup.select("div.rankbox div")[0] # div starting with Ranking
# remove all divs and spans
for d in x.find_all("div"):
d.extract()
for s in x.find_all("span"):
s.extract()
x = x.text
x = re.sub(r"^Ranking", "", x) # remove "Ranking" at first"
x = x.strip()
x
# '**Text to Capture**'
CodePudding user response:
Previous answer helped me to find the shortest code for this:
xtract = soup.find('div', attrs={'class': 'zr_rankbox'})
x = xtract.select('div')[0].find_all(text=True, recursive=False)[1].get_text(strip=True)
without type error warning