Scraping html data which are viewable in browser and not others-CodePudding

I am trying to scrape webpages to get some data. I need the html data which is viewable in the browser as a whole for my application. But when I scrape some urls, I am getting data which are not viewable from browser. But in the html code its there. So is there any way to scrape the data which is viewable only in the browser

Code

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service

options = webdriver.ChromeOptions()
options.add_argument("--headless")
service = Service("/home/nebu/selenium_drivers/chromedriver")

URL = "https://augustasymphony.com/event/top-of-the-world/"
try:
    driver = webdriver.Chrome(service = service, options = options)
    driver.get(URL)
    driver.implicitly_wait(2)
    html_content = driver.page_source
    driver.quit()
except WebDriverException:
    driver.quit()

soup = BeautifulSoup(html_content)
for each in ['header','footer']:
        s = soup.find(each)
        if s == None:
            continue
        else:
            s.extract()
text = soup.getText(separator=u' ')
print(text)

CodePudding user response：

This is simply a case of you needing to extract the data in a more specific manner.

You have 2 options really:

Option 1: (In my opinion the better, as it is faster and less resource heavy.)

import requests
from bs4 import BeautifulSoup as bs


headers = {'Accept': '*/*',
 'Connection': 'keep-alive',
 'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/top-of-the-world/", headers=headers)
soup = bs(res.text, "lxml")

event_header = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
time = soup.find("p", {"class": "rhino-event-time"}).text.strip()

You can use requests quite simply to find the data as shown in the code above specifically selecting the data you want and perhap saving it in a dictionary. This is the normal way to go about it. It may contain a lot of scripts in the page, however the page doesn't require JavaScript to load said data dynamically.

Option2:

You continue using selenium and can collect the entire body information of the page using one of multiple selections.

driver.find_element_by_id('wrapper').get_attribute('innerHTML') # Entire body
driver.find_element_by_id('tribe-events').get_attribute('innerHTML') # the events list
driver.find_element_by_id('rhino-event-single-content').get_attribute('innerHTML') # the single event

This second option is a lot more just taking the whole html and dumping it.

Personally I would go with the first option creating dictionaries of the cleaned data.

Edit:

To futher illustrate my example


import requests
from bs4 import BeautifulSoup as bs
headers = {'Accept': '*/*',
 'Connection': 'keep-alive',
 'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/", headers=headers)
soup = bs(res.text, "lxml")
seedlist = {a["href"] for a in soup.find("div", {"id": "tribe-events-content-wrapper"}).find_all("a") if '?ical=1' not in a["href"]}
for seed in seedlist:
    res = requests.get(seed, headers=headers)
    soup = bs(res.text, "lxml")
    data = dict()
    data['event_header'] = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
    data['time'] = soup.find("p", {"class": "rhino-event-time"}).text.strip()
    print(data)

Here I am generting a seedlist of event urls and then going into each one to find information.

CodePudding user response：

It's because some websites detect if it's a web browser.

So they don't send the HTML file back.

That's why there is no HTML send back