Home > Enterprise >  Scrapping Text Content of a Tag, But Not Text from Other Tags Inside of First
Scrapping Text Content of a Tag, But Not Text from Other Tags Inside of First

Time:11-07

I need to get the text from first <td> element of each <tr>. But not all the text, only the one inside tags <a> and outside of any other tag. I wrote examples of necessary text as "yyy"/"y" and examples of not necessary as "zzz"

<table>
  <tbody>
    <tr>
      <td>
        <b>zzz</b>
        <a href="#">yyy</a>
        "y"
        <a href="#">yyy</a>
        <sup>zzz</sup>
        <a href="#">yyy</a>
        <a href="#">yyy</a>
        "y"
      </td>
      <td>
        zzzzz
      </td>
    </tr>
  </tbody>
</table>

Here what I have at the moment

words = []
for tableRows in soup.select("table > tbody > tr"):
  tableData = tableRows.find("td").text
  text = [word.strip() for word in tableData.split(' ')]
  words.append(text)
print(words)

But this code is parsing all the text from <td>: ["zzz", "yyyy", "yyyy", "zzz", "yyyy"].

CodePudding user response:

Try:

from bs4 import BeautifulSoup, Tag, NavigableString

html_doc = """\
<table>
  <tbody>
    <tr>
      <td>
        <b>zzz</b>
        <a href="#">yyy</a>
        "y"
        <a href="#">yyy</a>
        <sup>zzz</sup>
        <a href="#">yyy</a>
        <a href="#">yyy</a>
        "y"
      </td>
      <td>
        zzzzz
      </td>
    </tr>
  </tbody>
</table>"""

soup = BeautifulSoup(html_doc, "html.parser")

for td in soup.select("td:nth-of-type(1)"):
    for c in td.contents:
        if isinstance(c, Tag) and c.name == "a":
            print(c.text.strip())
        elif isinstance(c, NavigableString):
            c = c.strip()
            if c:
                print(c)

Prints:

yyy
"y"
yyy
yyy
yyy
"y"

  • soup.select("td:nth-of-type(1)") selects just first <td>.
  • then we iterate over .contents of this <td>
  • if isinstance(c, Tag) and c.name == "a" checks if the content is Tag and the name of the Tag is <a>
  • if isinstance(c, NavigableString) checks if the content is plain string.

CodePudding user response:

Based on your example, use the children of td tag. Then check child having name a of None. Then check if child having text then append.

words = []

for item in soup.select("table > tbody > tr"):
    for child in item.td.children:        
        if child.name=='a' or child.name==None:
           if child.text.strip():
              words.append(child.text.strip())
print(words)

Output:

['yyy', '"y"', 'yyy', 'yyy', 'yyy', '"y"']
  • Related