I want to scrape the bus schedule times from the following website https://www.redbus.in/. By putting the locations I am interested in the search fields I arrive at the following link which is an example of ones I am interested in: https://www.redbus.in/bus-tickets/bhopal-to-indore?fromCityName=Bhopal&fromCityId=979&toCityName=Indore&toCityId=313&onward=18-Sep-2022&srcCountry=IND&destCountry=IND&opId=0&busType=Any
When I manually save this page and open the HTML file I can find the search results including Bus operator names, departure times, fare etc. But when I do the same using Python that part of the page is not saved. The code I am using is the following:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.redbus.in/bus-tickets/bhopal-to-indore?fromCityName=Bhopal&fromCityId=979&toCityName=Indore&toCityId=313&onward=18-Sep-2022&srcCountry=IND&destCountry=IND&opId=0&busType=Any"
browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source)
browser.quit()
soup
object that is created this way has all the other content of the page in HTML format except the search results showing the bus route and time information. I am not sure why that is the case.
I am new to web scrapping so any help here will be really appreciated.
CodePudding user response:
Main issue is that the data you expect needs a moment to be loaded and rendered by the browser - so simplest way is to wait a second or two.
...
import time
time.sleep(2)
soup = BeautifulSoup(browser.page_source)
browser.quit()
But selenium
is not necessarry, you can also access the JSON with all the data via requests
:
import requests
import json
url = "https://www.redbus.in/search/SearchResults?fromCity=979&toCity=313&src=Bhopal&dst=Indore&DOJ=18-Sep-2022§ionId=0&groupId=0&limit=0&offset=0&sort=0&sortOrder=0&meta=true&returnSearch=0"
headers = {
'authority': 'www.redbus.in',
'accept': 'application/json, text/plain, */*',
'content-length': '0',
'content-type': 'application/json',
'cookie': 'country=IND; currency=INR; selectedCurrency=INR; language=en;',
'origin': 'https://www.redbus.in',
'user-agent': 'Mozilla/5.0'
}
response = requests.request("POST", url, headers=headers)
response.json()['inv']
CodePudding user response:
You don't need to use selenium or soup. Sometimes you can use only the 'requests' module and check if there is an api that sends you a response.(You can do this via network tab of your browser).
For example the site you want to scrap, it seems that there is one api. So you take a look and understand how to their api works and you send a request exactly the same way and get the json response and then you parse it.
CodePudding user response:
Solution: According to your selenium
tag
Your code is about to your destination. Just you have to invoke three things in your code:
Use a bit load time
Maximize the window size
Inject a parser
Example:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://www.redbus.in/bus-tickets/bhopal-to-indore?fromCityName=Bhopal&fromCityId=979&toCityName=Indore&toCityId=313&onward=18-Sep-2022&srcCountry=IND&destCountry=IND&opId=0&busType=Any"
driver.get(url)
driver.maximize_window()
time.sleep(5)
content = driver.page_source
soup = BeautifulSoup(content,"html.parser")
times = soup.find_all("div", class_="dp-time f-19 d-color f-bold")
for t in times:
schedule_time = t.get_text(strip=True)
print(schedule_time)
Output:
07:00
19:00
17:00
09:00
18:00