I'm parsing a text, which every word is made into a link. Problem is that punctuation marks aren't the content of that tags <a>
, they just lie outside the tags, so I don't know what to do to get punctuation marks too.
<table>
<tbody>
<tr>
<td>
<a href="#">Lorem</a>
", "
<a href="#">Ipsum</a>
": "
<a href="#">dolor</a>
"."
</td>
<td>...</td>
</tr>
<tr>
<td>
<a href="#">sit</a>
"? '"
<a href="#">amet</a>
"' "
<a href="#">consectetur</a>
"..."
</td>
<td>...</td>
</tr>
<tr>
<td>
<a href="#">adipisicing</a>
"-"
<a href="#">elit</a>
"; "
<a href="#">Molestias</a>
"!"
</td>
<td>...</td>
</tr>
</tbody>
</table>
here's the parser
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')
words = []
for tableRows in soup.select("table > tbody > tr"):
for word in tableRows.find("td").select("a"):
words.append(word.text)
print(words)
CodePudding user response:
The text content between a
tag elements belongs to the parent td
element itself.
You can directly grab text from td
elements, as following:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')
words = []
for tableRow in soup.select("table > tbody > tr"):
words.append(tableRow.text)
print(words)
UPD
In case you want to get punctuation marks as separated objects you can split the table row text by spaces. The following code should do that remove leading and trailing spaces.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path="...")
driver.get(url=...)
soup = BeautifulSoup(driver.page_source, 'html.parser')
words = []
for tableRow in soup.select("table > tbody > tr"):
tableRowtext = tableRow.text
rowTexts = [x.strip() for x in tableRowtext.split(' ')]
words.append(rowTexts)
print(words)