Why does search results not showing up when web scraping bus schedules?-CodePudding

I want to scrape the bus schedule times from the following website https://www.redbus.in/. By putting the locations I am interested in the search fields I arrive at the following link which is an example of ones I am interested in: https://www.redbus.in/bus-tickets/bhopal-to-indore?fromCityName=Bhopal&fromCityId=979&toCityName=Indore&toCityId=313&onward=18-Sep-2022&srcCountry=IND&destCountry=IND&opId=0&busType=Any

When I manually save this page and open the HTML file I can find the search results including Bus operator names, departure times, fare etc. But when I do the same using Python that part of the page is not saved. The code I am using is the following:

from selenium import webdriver
from bs4 import BeautifulSoup

url = "https://www.redbus.in/bus-tickets/bhopal-to-indore?fromCityName=Bhopal&fromCityId=979&toCityName=Indore&toCityId=313&onward=18-Sep-2022&srcCountry=IND&destCountry=IND&opId=0&busType=Any"

browser = webdriver.Chrome()
browser.get(url)
soup = BeautifulSoup(browser.page_source)
browser.quit()

soup object that is created this way has all the other content of the page in HTML format except the search results showing the bus route and time information. I am not sure why that is the case. I am new to web scrapping so any help here will be really appreciated.

CodePudding user response：

Main issue is that the data you expect needs a moment to be loaded and rendered by the browser - so simplest way is to wait a second or two.

...
import time
time.sleep(2)
soup = BeautifulSoup(browser.page_source)
browser.quit()

But selenium is not necessarry, you can also access the JSON with all the data via requests:

import requests
import json

url = "https://www.redbus.in/search/SearchResults?fromCity=979&toCity=313&src=Bhopal&dst=Indore&DOJ=18-Sep-2022&sectionId=0&groupId=0&limit=0&offset=0&sort=0&sortOrder=0&meta=true&returnSearch=0"

headers = {
  'authority': 'www.redbus.in',
  'accept': 'application/json, text/plain, */*',
  'content-length': '0',
  'content-type': 'application/json',
  'cookie': 'country=IND; currency=INR; selectedCurrency=INR; language=en;',
  'origin': 'https://www.redbus.in',
  'user-agent': 'Mozilla/5.0'
}

response = requests.request("POST", url, headers=headers)

response.json()['inv']

CodePudding user response：

You don't need to use selenium or soup. Sometimes you can use only the 'requests' module and check if there is an api that sends you a response.(You can do this via network tab of your browser).

For example the site you want to scrap, it seems that there is one api. So you take a look and understand how to their api works and you send a request exactly the same way and get the json response and then you parse it.

CodePudding user response：

Solution: According to your selenium tag

Your code is about to your destination. Just you have to invoke three things in your code:

Use a bit load time
Maximize the window size
Inject a parser

Example:

from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service


webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)

url= "https://www.redbus.in/bus-tickets/bhopal-to-indore?fromCityName=Bhopal&fromCityId=979&toCityName=Indore&toCityId=313&onward=18-Sep-2022&srcCountry=IND&destCountry=IND&opId=0&busType=Any"
driver.get(url)
driver.maximize_window()
time.sleep(5)

content = driver.page_source
soup = BeautifulSoup(content,"html.parser")

times = soup.find_all("div", class_="dp-time f-19 d-color f-bold")
for t in times:
    schedule_time = t.get_text(strip=True)
    print(schedule_time)

Output: