Home > Enterprise >  How to Parse a Text Which is Outside of Tag
How to Parse a Text Which is Outside of Tag

Time:11-07

I'm parsing a text, which every word is made into a link. Problem is that punctuation marks aren't the content of that tags <a>, they just lie outside the tags, so I don't know what to do to get punctuation marks too.

<table>
  <tbody>
    <tr>
      <td>
        <a href="#">Lorem</a>
        ", "
        <a href="#">Ipsum</a>
        ": "
        <a href="#">dolor</a>
        "."
      </td>
      <td>...</td>
    </tr>
    <tr>
      <td>
        <a href="#">sit</a>
        "? '"
        <a href="#">amet</a>
        "' "
        <a href="#">consectetur</a>
        "..."
      </td>
      <td>...</td>
    </tr>
    <tr>
      <td>
        <a href="#">adipisicing</a>
        "-"
        <a href="#">elit</a>
        "; "
        <a href="#">Molestias</a>
        "!"
      </td>
      <td>...</td>
    </tr>
  </tbody>
</table>

here's the parser

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')

words = []
for tableRows in soup.select("table > tbody > tr"):
  for word in tableRows.find("td").select("a"):
    words.append(word.text)

print(words)

CodePudding user response:

The text content between a tag elements belongs to the parent td element itself.
You can directly grab text from td elements, as following:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')

words = []
for tableRow in soup.select("table > tbody > tr"):
  words.append(tableRow.text)

print(words)

UPD
In case you want to get punctuation marks as separated objects you can split the table row text by spaces. The following code should do that remove leading and trailing spaces.

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')

words = []
for tableRow in soup.select("table > tbody > tr"):
  tableRowtext = tableRow.text
  rowTexts = [x.strip() for x in tableRowtext.split(' ')]
  words.append(rowTexts)

print(words)
  • Related