Scrape the td which is not coming under the tr-CodePudding

Hi I want to scrape the content of a table from a website by using the python code.HTML of the table is mentioned below.

    <table  title="">   <tbody>
      <tr>
         <td colspan="7"><br/></td>
         <td style="text-align:center;"><strong>N/A*</strong></td>
      </tr>
      <td colspan="7"><a href="persondetail.php?custnumber">abc</a><br/></td>
      <td style="text-align:center;"><strong>N/A*</strong></td>
      <td colspan="7"><a href="persondetail.php?custnumbe">abc</a><br/></td>
      <td style="text-align:center;"><strong>N/A*</strong></td>
      <td colspan="7"><br/></td>
      <td style="text-align:center;"><strong>N/A*</strong></td>
      <td colspan="7"><a href="persondetail.php?custnumber">abc</a><br/></td>
      <td style="text-align:center;"><strong>N/A*</strong></td>
      <td colspan="7"><a href="persondetail.php?custnumber">abc</a><br/></td>
      <td style="text-align:center;"><strong>N/A*</strong></td>
   </tbody>
</table>

Below is the Python code which I'm using to scrape the above HTML.

table_data = soup.find('tbody')
for j in table_data.find_all('tr'):
    row_data = j.find_all('td')
    row = [tr.text for tr in row_data]
    thewriter.writerow (row)

when I get the Result it returns only 1st two rows because the other rows are without "tr".

CodePudding user response：

You maybe could directly use the find_all("td") like this:

table_data = soup.find('tbody')
for j in table_data.find_all('td'):    
    row = [tr.text for tr in j]
    thewriter.writerow (row)

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters. I gave several examples in Kinds of filters, but here are a few more.

From the documentation

CodePudding user response：

All tds are direct child node of table tag except tr>td

from bs4 import BeautifulSoup

html_doc="""

<table  title="">   <tbody>
  <tr>
     <td colspan="7"><br/></td>
     <td style="text-align:center;"><strong>N/A*</strong></td>
  </tr>
  <td colspan="7"><a href="persondetail.php?custnumber">abc</a><br/></td>
  <td style="text-align:center;"><strong>N/A*</strong></td>
  <td colspan="7"><a href="persondetail.php?custnumbe">abc</a><br/></td>
  <td style="text-align:center;"><strong>N/A*</strong></td>
  <td colspan="7"><br/></td>
  <td style="text-align:center;"><strong>N/A*</strong></td>
  <td colspan="7"><a href="persondetail.php?custnumber">abc</a><br/></td>
  <td style="text-align:center;"><strong>N/A*</strong></td>
  <td colspan="7"><a href="persondetail.php?custnumber">abc</a><br/></td>
  <td style="text-align:center;"><strong>N/A*</strong></td>

"""

soup = BeautifulSoup(html_doc, 'html.parser')

tds= soup.select('.table td')
for td in tds:
    print(td.text)

Output:

N/A*
abc
N/A*
abc
N/A*

N/A*
abc
N/A*
abc
N/A*