Home > Blockchain >  Select a specific column and ignore the rest in BeautifulSoup Python (Avoiding nested tables)
Select a specific column and ignore the rest in BeautifulSoup Python (Avoiding nested tables)

Time:08-27

I'm trying to get only the first two columns of a webpage table using beautifulsoup in python. The problem is that this table sometimes contains nested tables in the third column. The structure of the html is similar to this:

<table class:"relative-table wrapped">
  <tbody>
    <tr>
      <td>
      <\td>
      <td>
      <\td>
      <td>
      <\td>
    <\tr>
    <tr>
      <td>
      <\td>
      <td>
      <\td>
      <td>
        <div >
          <table >
          ...
          ...
          <\table>
        <\div>
      <\td>
    <\tr>
  <\tbody>
<\table>

The main problem is that I don't know how to simply ignore every third td so I don't read the nested tables inside the main one. I just want to have a list with the first column of the main table and another list with the second column of the main table but the nested table ruins everything when I'm reading. I have tried with this code:

import requests
from bs4 import BeautifulSoup

page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
links = soup.select("table.relative-table tbody tr td.confluenceTd")
anchorList = []
for anchor in links:
    anchorList.append(anchor.text)

del anchorList[2:len(anchorList):3]
for anchorItem in anchorList:
    print(anchorItem)
    print('-------------------')

This works really good until I reach the nested table and then it starts deleting other columns. I have also tried this other code:

page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

for row in soup.findAll('table')[0].tbody.findAll('tr'):
    firstColumn = row.findAll('td')[0].contents
    secondColumn = row.findAll('td')[1].contents
    print(firstColumn, secondColumn)

But I get an IndexError because it's reading the nested tabble and the nested table only has one td.

Does anyone knows how could I read the first two columns and ignore the rest?

Thank you.

CodePudding user response:

It may needs some improved examples / details to clarify, but as I understand you are selecting the first <table> and try to iterate its rows:

soup.table.select('tr:not(:has(table))')

Above selector would exclude all thr rows that includes an additional <table>.

Alternative would be to get rid of these last/third <td> :

for row in soup.table.select('tr'):
    row.select_one('td:last-of-type').decompose()

    #### or by its index row.select_one('td:nth-of-type(3)').decompose()

Now you could perform your selections on a <table> with two columns.

Example

from bs4 import BeautifulSoup
html ='''
<table class:"relative-table wrapped">
  <tbody>
    <tr>
      <td>
      </td>
      <td>
      </td>
      <td>
      </td>
    </tr>
    <tr>
      <td>
      </td>
      <td>
      </td>
      <td>
        <div >
          <table >
          ...
          ...
          </table>
        </div>
      </td>
    </tr>
  </tbody>
</table>
'''

soup = BeautifulSoup(html, 'html.parser')

for row in soup.table.select('tr'):
    row.select_one('td:last-of-type').decompose()

soup

New soup

<table class:"relative-table="" wrapped"="">
<tbody>
<tr>
<td>
</td>
<td>
</td>

</tr>
<tr>
<td>
</td>
<td>
</td>

</tr>
</tbody>
</table>
  • Related