Home > OS >  Using BeautifulSoup to parse html, I am getting unwanted prints. Why is that?
Using BeautifulSoup to parse html, I am getting unwanted prints. Why is that?

Time:01-19

I am using beautiful soup to parse an HTML document on Jupyter Notebook. This is a sample from the file. Please note that this same HTML sample is repeated multiple times. The below table tags are siblings and are surrounded by other tags

<table  width="100%" cellspacing="0" cellpadding="0" border="0">
   <tbody>
      <tr>
         <td colspan="2" width="100%" valign="top" bgcolor="#f0f0f0">
            <h3 > Title <a href="somelink">Title</a>
               <span > Date: 21/Dec/22 </span>
            </h3>
         </td>
      </tr>
      <tr>
         <td width="20%"><b>Status</b></td>
         <td width="80%">shipping</td>
      </tr>
   </tbody>
</table>
      
<table  width="100%" cellspacing="0" cellpadding="0" border="0">
   <tbody>
      <tr>
         <td width="20%" valign="top" bgcolor="#f0f0f0"> <b>some data</b></td>
         <td width="30%" valign="top" bgcolor="#ffffff"> some data </td>
         <td bgcolor="#f0f0f0"> <b>some data:</b>some data</td>
         <td valign="top" nowrap="" bgcolor="#ffffff">vsome data </td>
      </tr>
      <tr>
         <td width="20%" valign="top" bgcolor="#f0f0f0"> <b>some data:</b> </td>
      </tr>
   </tbody>
</table>

<table  width="100%" cellspacing="0" cellpadding="0" border="0">
   <tbody>
      <tr>
         <td width="20%" valign="top" bgcolor="#f0f0f0">
            <b>Sections</b>
         </td>
         <td  valign="top" bgcolor="#ffffff">
            <table  width="100%" cellspacing="0" cellpadding="0" border="0">
               <tbody>
                  <tr>
                     <td colspan="4" bgcolor="#f0f0f0"> <b>Section 1</b> </td>
                  </tr>
                  <tr>
                     <td> Test 1 </td>
                     <td> <a href="somelink"> Test 1 Code </a> </td>
                     <td> Test 1 Description </td>
                     <td> Test 1 Extended Description </td>
                  </tr>
                  <tr>
                     <td colspan="4" bgcolor="#f0f0f0"> <b>Section 2</b> </td>
                  </tr>
                  <tr>
                     <td> Test 2 </td>
                     <td> <a href="somelink"> Test 2 Code </a> </td>
                     <td> Test 2 Description </td>
                     <td> Test 2 Extended Description </td>
                  </tr>
                  <tr>
                     <td> Test 3 </td>
                     <td> <a href="somelink"> Test 3 Code </a> </td>
                     <td> Test 3 Description </td>
                     <td> Test 3 Extended Description </td>
                  </tr>
               </tbody>
            </table>
         </td>
      </tr>
   </tbody>
</table>

I have the following python code that is printing unwanted results (duplicates) when I m running it. I am not sure what I am doing wrong

mainHtml = soup.find_all('table', class_='tableBorder')

for main in mainHtml:
    
    print ()
    print ("URL : ", main.tbody.tr.td.h3.a["href"])
    print ("Title : ", main.tbody.tr.td.h3.a.text)
    print ("Status : ", main.tbody.select('tr')[1].select('td')[1].text)

    linked = main.find_next_sibling('table', class_='grid')
    if linked:
        linked = linked.find_next_sibling('table', class_='grid')
    
    if linked:
        rows = linked.find_all('tr')

#       Iterate through the rows and extract the information
        for row in rows:
        
            cells = row.find_all('td')
            
            if len(cells) >= 4:
                
#               Extract the information from the cells
                a= cells[0].text.strip()
                b = cells[1].text.strip()
                c = cells[2].text.strip()
                d = cells[3].text.strip()
                
                print(a, b, c, d)

The output where I have an issue with unwanted prints is the following

Test 1 
Test 1 Code 
Test 1 Description 
Test 1 Extended Description

Test 2 
Test 2 Code 
Test 2 Description 
Test 2 Extended Description

Test 3 
Test 3 Code 
Test 3 Description 
Test 3 Extended Description

Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description

Since I have one print statement at the end, I would like to have the following format only and I am getting it after the unwanted prints that are occurring. What can cause and is there any option to solve that

Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description

CodePudding user response:

My take on the problem would be searching "backwards" - find the table with the description and then search backwards for URL/Title/Status:

soup = BeautifulSoup(html_doc, 'html.parser')  # html_doc contains your HTML snippet from the question

for table in soup.select('table:has(b:-soup-contains(Sections))'):
    url = table.find_previous('h3').a['href']
    title = table.find_previous('h3').a.text
    status = table.find_previous(lambda tag: tag.name=='b' and tag.text=='Status').find_next('td').text

    print(url)
    print(title)
    print(status)

    print()

    for row in table.select('tr:not(:has([colspan]))'):
        print(' '.join(td.text.strip() for td in row.select('td')))

Prints:

somelink
Title
shipping

Test 1 Test 1 Code Test 1 Description Test 1 Extended Description
Test 2 Test 2 Code Test 2 Description Test 2 Extended Description
Test 3 Test 3 Code Test 3 Description Test 3 Extended Description
  • Related