Home > OS >  Scraping tables using beautiful soup but not displaying as desired
Scraping tables using beautiful soup but not displaying as desired

Time:01-11

I have been trying to scrape a table using a mixture of guides and chatgpt.

I have been bringing in the following html code:

<tr>


<td >

<a href="/vehicles/fgla-33101">33101</a>

</td>
<td><a href="/vehicles/fgla-33101">SK19 EOM</a></td>

<td >
500
</td>
<td >

7 Jan 23:30

</td>
<td>ADL Enviro400 City</td>
<td >

<div ></div>
Glasgow Airport Express

</td>




<td ></td>
<td >USB power</td>

<td><a href="https://www.flickr.com/search/?text=SK19EOM or "SK19 EOM" or First 33101&amp;sort=date-taken-desc" target="_blank" rel="noopener">Flickr</a></td>

<td><a href="/vehicles/fgla-33101/edit">Edit</a></td>

</tr>
<tr>


<td >

<a href="/vehicles/fgla-33102">33102</a>

</td>
<td><a href="/vehicles/fgla-33102">SK19 EOO</a></td>

<td >
500
</td>
<td >

7 Jan 18:35

</td>
<td>ADL Enviro400 City</td>
<td >

<div ></div>
Glasgow Airport Express

</td>




<td ></td>
<td >USB power</td>

<td><a href="https://www.flickr.com/search/?text=SK19EOO or "SK19 EOO" or First 33102&amp;sort=date-taken-desc" target="_blank" rel="noopener">Flickr</a></td>

<td><a href="/vehicles/fgla-33102/edit">Edit</a></td>

</tr>

Following these guides what i've tried is

# Parse the HTML of the web page
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the buses on the page
buses = soup.find_all('table', class_='fleet compact')

print(buses)

# Loop through the buses and extract the information
for bus in buses:
  fleet_number = (bus.find('td', class_='number').text)
  registration = (bus.find('td', class_='number'))
  service_number = (bus.find('td', class_='last-seen').text)
  last_seen = (bus.find('td', class_='last-seen'))
  model = (bus.find('td').text)
  

  # Print the scraped information
  print(fleet_number)
  print(registration)
  print(service_number)
  print(last_seen)
  print(model)

But this only gives me

33101

<td >
<a href="/vehicles/fgla-33101">33101</a>
</td>

500

<td >
500
</td>

33101

But my expected output was

33101 
SK19 EOM 
500 
7 Jan 23:30 
ADL Enviro400 City

I'm unsure how to do it any other way currently. Would there be a way to make this work?

CodePudding user response:

You can try:

from bs4 import BeautifulSoup

html_doc = """\
<tr>
  <td >
    <a href="/vehicles/fgla-33101">33101</a>
  </td>
  <td>
    <a href="/vehicles/fgla-33101">SK19 EOM</a>
  </td>
  <td > 500 </td>
  <td > 7 Jan 23:30 </td>
  <td>ADL Enviro400 City</td>
  <td >
    <div ></div> Glasgow Airport Express
  </td>
  <td ></td>
  <td >USB power</td>
  <td>
    <a href="https://www.flickr.com/search/?text=SK19EOM or "SK19 EOM" or First 33101&amp;sort=date-taken-desc" target="_blank" rel="noopener">Flickr</a>
  </td>
  <td>
    <a href="/vehicles/fgla-33101/edit">Edit</a>
  </td>
</tr>
<tr>
  <td >
    <a href="/vehicles/fgla-33102">33102</a>
  </td>
  <td>
    <a href="/vehicles/fgla-33102">SK19 EOO</a>
  </td>
  <td > 500 </td>
  <td > 7 Jan 18:35 </td>
  <td>ADL Enviro400 City</td>
  <td >
    <div ></div> Glasgow Airport Express
  </td>
  <td ></td>
  <td >USB power</td>
  <td>
    <a href="https://www.flickr.com/search/?text=SK19EOO or "SK19 EOO" or First 33102&amp;sort=date-taken-desc" target="_blank" rel="noopener">Flickr</a>
  </td>
  <td>
    <a href="/vehicles/fgla-33102/edit">Edit</a>
  </td>
</tr>"""

soup = BeautifulSoup(html_doc, "html.parser")

for row in soup.select("tr"):
    tds = [td.text.strip() for td in row.select("td")]
    fleet_number, registration, service_number, last_seen, model = tds[:5]
    print(fleet_number, registration, service_number, last_seen, model, sep="\n")
    print('-' * 80)

Prints:

33101
SK19 EOM
500
7 Jan 23:30
ADL Enviro400 City
--------------------------------------------------------------------------------
33102
SK19 EOO
500
7 Jan 18:35
ADL Enviro400 City
--------------------------------------------------------------------------------
  • Related