I am using beautiful soup to parse an HTML document on Jupyter Notebook. This is a sample from the file. Please note that this same HTML sample is repeated multiple times. The below table tags are siblings and are surrounded by other tags
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td colspan="2" width="100%" valign="top" bgcolor="#f0f0f0">
<h3 > Title <a href="somelink">Title</a>
<span > Date: 21/Dec/22 </span>
</h3>
</td>
</tr>
<tr>
<td width="20%"><b>Status</b></td>
<td width="80%">shipping</td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td width="20%" valign="top" bgcolor="#f0f0f0"> <b>some data</b></td>
<td width="30%" valign="top" bgcolor="#ffffff"> some data </td>
<td bgcolor="#f0f0f0"> <b>some data:</b>some data</td>
<td valign="top" nowrap="" bgcolor="#ffffff">vsome data </td>
</tr>
<tr>
<td width="20%" valign="top" bgcolor="#f0f0f0"> <b>some data:</b> </td>
</tr>
</tbody>
</table>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td width="20%" valign="top" bgcolor="#f0f0f0">
<b>Sections</b>
</td>
<td valign="top" bgcolor="#ffffff">
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<td colspan="4" bgcolor="#f0f0f0"> <b>Section 1</b> </td>
</tr>
<tr>
<td> Test 1 </td>
<td> <a href="somelink"> Test 1 Code </a> </td>
<td> Test 1 Description </td>
<td> Test 1 Extended Description </td>
</tr>
<tr>
<td colspan="4" bgcolor="#f0f0f0"> <b>Section 2</b> </td>
</tr>
<tr>
<td> Test 2 </td>
<td> <a href="somelink"> Test 2 Code </a> </td>
<td> Test 2 Description </td>
<td> Test 2 Extended Description </td>
</tr>
<tr>
<td> Test 3 </td>
<td> <a href="somelink"> Test 3 Code </a> </td>
<td> Test 3 Description </td>
<td> Test 3 Extended Description </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
I have the following python code that is printing unwanted results (duplicates) when I m running it. I am not sure what I am doing wrong
mainHtml = soup.find_all('table', class_='tableBorder')
for main in mainHtml:
print ()
print ("URL : ", main.tbody.tr.td.h3.a["href"])
print ("Title : ", main.tbody.tr.td.h3.a.text)
print ("Status : ", main.tbody.select('tr')[1].select('td')[1].text)
linked = main.find_next_sibling('table', class_='grid')
if linked:
linked = linked.find_next_sibling('table', class_='grid')
if linked:
rows = linked.find_all('tr')
# Iterate through the rows and extract the information
for row in rows:
cells = row.find_all('td')
if len(cells) >= 4:
# Extract the information from the cells
a= cells[0].text.strip()
b = cells[1].text.strip()
c = cells[2].text.strip()
d = cells[3].text.strip()
print(a, b, c, d)
The output where I have an issue with unwanted prints is the following
Test 1
Test 1 Code
Test 1 Description
Test 1 Extended Description
Test 2
Test 2 Code
Test 2 Description
Test 2 Extended Description
Test 3
Test 3 Code
Test 3 Description
Test 3 Extended Description
Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description
Since I have one print statement at the end, I would like to have the following format only and I am getting it after the unwanted prints that are occurring. What can cause and is there any option to solve that
Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description
CodePudding user response:
My take on the problem would be searching "backwards" - find the table with the description and then search backwards for URL/Title/Status:
soup = BeautifulSoup(html_doc, 'html.parser') # html_doc contains your HTML snippet from the question
for table in soup.select('table:has(b:-soup-contains(Sections))'):
url = table.find_previous('h3').a['href']
title = table.find_previous('h3').a.text
status = table.find_previous(lambda tag: tag.name=='b' and tag.text=='Status').find_next('td').text
print(url)
print(title)
print(status)
print()
for row in table.select('tr:not(:has([colspan]))'):
print(' '.join(td.text.strip() for td in row.select('td')))
Prints:
somelink
Title
shipping
Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description