I have been trying to scrape a table using a mixture of guides and chatgpt.
I have been bringing in the following html code:
<tr>
<td >
<a href="/vehicles/fgla-33101">33101</a>
</td>
<td><a href="/vehicles/fgla-33101">SK19 EOM</a></td>
<td >
500
</td>
<td >
7 Jan 23:30
</td>
<td>ADL Enviro400 City</td>
<td >
<div ></div>
Glasgow Airport Express
</td>
<td ></td>
<td >USB power</td>
<td><a href="https://www.flickr.com/search/?text=SK19EOM or "SK19 EOM" or First 33101&sort=date-taken-desc" target="_blank" rel="noopener">Flickr</a></td>
<td><a href="/vehicles/fgla-33101/edit">Edit</a></td>
</tr>
<tr>
<td >
<a href="/vehicles/fgla-33102">33102</a>
</td>
<td><a href="/vehicles/fgla-33102">SK19 EOO</a></td>
<td >
500
</td>
<td >
7 Jan 18:35
</td>
<td>ADL Enviro400 City</td>
<td >
<div ></div>
Glasgow Airport Express
</td>
<td ></td>
<td >USB power</td>
<td><a href="https://www.flickr.com/search/?text=SK19EOO or "SK19 EOO" or First 33102&sort=date-taken-desc" target="_blank" rel="noopener">Flickr</a></td>
<td><a href="/vehicles/fgla-33102/edit">Edit</a></td>
</tr>
Following these guides what i've tried is
# Parse the HTML of the web page
soup = BeautifulSoup(response.text, 'html.parser')
# Find all the buses on the page
buses = soup.find_all('table', class_='fleet compact')
print(buses)
# Loop through the buses and extract the information
for bus in buses:
fleet_number = (bus.find('td', class_='number').text)
registration = (bus.find('td', class_='number'))
service_number = (bus.find('td', class_='last-seen').text)
last_seen = (bus.find('td', class_='last-seen'))
model = (bus.find('td').text)
# Print the scraped information
print(fleet_number)
print(registration)
print(service_number)
print(last_seen)
print(model)
But this only gives me
33101
<td >
<a href="/vehicles/fgla-33101">33101</a>
</td>
500
<td >
500
</td>
33101
But my expected output was
33101
SK19 EOM
500
7 Jan 23:30
ADL Enviro400 City
I'm unsure how to do it any other way currently. Would there be a way to make this work?
CodePudding user response:
You can try:
from bs4 import BeautifulSoup
html_doc = """\
<tr>
<td >
<a href="/vehicles/fgla-33101">33101</a>
</td>
<td>
<a href="/vehicles/fgla-33101">SK19 EOM</a>
</td>
<td > 500 </td>
<td > 7 Jan 23:30 </td>
<td>ADL Enviro400 City</td>
<td >
<div ></div> Glasgow Airport Express
</td>
<td ></td>
<td >USB power</td>
<td>
<a href="https://www.flickr.com/search/?text=SK19EOM or "SK19 EOM" or First 33101&sort=date-taken-desc" target="_blank" rel="noopener">Flickr</a>
</td>
<td>
<a href="/vehicles/fgla-33101/edit">Edit</a>
</td>
</tr>
<tr>
<td >
<a href="/vehicles/fgla-33102">33102</a>
</td>
<td>
<a href="/vehicles/fgla-33102">SK19 EOO</a>
</td>
<td > 500 </td>
<td > 7 Jan 18:35 </td>
<td>ADL Enviro400 City</td>
<td >
<div ></div> Glasgow Airport Express
</td>
<td ></td>
<td >USB power</td>
<td>
<a href="https://www.flickr.com/search/?text=SK19EOO or "SK19 EOO" or First 33102&sort=date-taken-desc" target="_blank" rel="noopener">Flickr</a>
</td>
<td>
<a href="/vehicles/fgla-33102/edit">Edit</a>
</td>
</tr>"""
soup = BeautifulSoup(html_doc, "html.parser")
for row in soup.select("tr"):
tds = [td.text.strip() for td in row.select("td")]
fleet_number, registration, service_number, last_seen, model = tds[:5]
print(fleet_number, registration, service_number, last_seen, model, sep="\n")
print('-' * 80)
Prints:
33101
SK19 EOM
500
7 Jan 23:30
ADL Enviro400 City
--------------------------------------------------------------------------------
33102
SK19 EOO
500
7 Jan 18:35
ADL Enviro400 City
--------------------------------------------------------------------------------