The Situation
I am trying to scrape webpages to get some data. I need the html data which is viewable in the browser as a whole for my application.
The Problem
But when I scrape some urls, I am getting data which are not viewable from browser. But in the html code its there. So is there any way to scrape the data which is viewable only in the browser
Code
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
options.add_argument("--headless")
service = Service("/home/nebu/selenium_drivers/chromedriver")
URL = "https://augustasymphony.com/event/top-of-the-world/"
try:
driver = webdriver.Chrome(service = service, options = options)
driver.get(URL)
driver.implicitly_wait(2)
html_content = driver.page_source
driver.quit()
except WebDriverException:
driver.quit()
soup = BeautifulSoup(html_content)
for each in ['header','footer']:
s = soup.find(each)
if s == None:
continue
else:
s.extract()
text = soup.getText(separator=u' ')
print(text)
The Question
Where am I going wrong here? How can I go about debugging this?
CodePudding user response:
This is simply a case of you needing to extract the data in a more specific manner.
You have 2 options really:
Option 1: (In my opinion the better, as it is faster and less resource heavy.)
import requests
from bs4 import BeautifulSoup as bs
headers = {'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/top-of-the-world/", headers=headers)
soup = bs(res.text, "lxml")
event_header = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
time = soup.find("p", {"class": "rhino-event-time"}).text.strip()
You can use requests quite simply to find the data as shown in the code above specifically selecting the data you want and perhap saving it in a dictionary. This is the normal way to go about it. It may contain a lot of scripts in the page, however the page doesn't require JavaScript to load said data dynamically.
Option2:
You continue using selenium and can collect the entire body information of the page using one of multiple selections.
driver.find_element_by_id('wrapper').get_attribute('innerHTML') # Entire body
driver.find_element_by_id('tribe-events').get_attribute('innerHTML') # the events list
driver.find_element_by_id('rhino-event-single-content').get_attribute('innerHTML') # the single event
This second option is a lot more just taking the whole html and dumping it.
Personally I would go with the first option creating dictionaries of the cleaned data.
Edit:
To futher illustrate my example
import requests
from bs4 import BeautifulSoup as bs
headers = {'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/", headers=headers)
soup = bs(res.text, "lxml")
seedlist = {a["href"] for a in soup.find("div", {"id": "tribe-events-content-wrapper"}).find_all("a") if '?ical=1' not in a["href"]}
for seed in seedlist:
res = requests.get(seed, headers=headers)
soup = bs(res.text, "lxml")
data = dict()
data['event_header'] = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
data['time'] = soup.find("p", {"class": "rhino-event-time"}).text.strip()
print(data)
Here I am generting a seedlist of event urls and then going into each one to find information.
CodePudding user response:
It's because some websites detect if it's a web browser.
So they don't send the HTML file back.
That's why there is no HTML send back