Collecting the data displayed in browser but not in response-CodePudding

The Situation

I am trying to scrape webpages to get some data. I need the html data which is viewable in the browser as a whole for my application.

The Problem

But when I scrape some urls, I am getting data which are not viewable from browser. But in the html code its there. So is there any way to scrape the data which is viewable only in the browser

Code

    from bs4 import BeautifulSoup
    import requests
    from selenium import webdriver
    from selenium.common.exceptions import WebDriverException
    from selenium.webdriver.chrome.service import Service
    
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    service = Service("/home/nebu/selenium_drivers/chromedriver")
    
    URL = "https://augustasymphony.com/event/top-of-the-world/"
    try:
        driver = webdriver.Chrome(service = service, options = options)
        driver.get(URL)
        driver.implicitly_wait(2)
        html_content = driver.page_source
        driver.quit()
    except WebDriverException:
        driver.quit()
    
    soup = BeautifulSoup(html_content)
    for each in ['header','footer']:
            s = soup.find(each)
            if s == None:
                continue
            else:
                s.extract()
    text = soup.getText(separator=u' ')
    print(text)

The Question

Where am I going wrong here? How can I go about debugging this?

CodePudding user response：

This is simply a case of you needing to extract the data in a more specific manner.

You have 2 options really:

Option 1: (In my opinion the better, as it is faster and less resource heavy.)

import requests
from bs4 import BeautifulSoup as bs


headers = {'Accept': '*/*',
 'Connection': 'keep-alive',
 'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/top-of-the-world/", headers=headers)
soup = bs(res.text, "lxml")

event_header = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
time = soup.find("p", {"class": "rhino-event-time"}).text.strip()

You can use requests quite simply to find the data as shown in the code above specifically selecting the data you want and perhap saving it in a dictionary. This is the normal way to go about it. It may contain a lot of scripts in the page, however the page doesn't require JavaScript to load said data dynamically.

Option2:

You continue using selenium and can collect the entire body information of the page using one of multiple selections.

driver.find_element_by_id('wrapper').get_attribute('innerHTML') # Entire body
driver.find_element_by_id('tribe-events').get_attribute('innerHTML') # the events list
driver.find_element_by_id('rhino-event-single-content').get_attribute('innerHTML') # the single event

This second option is a lot more just taking the whole html and dumping it.

Personally I would go with the first option creating dictionaries of the cleaned data.

Edit:

To futher illustrate my example


import requests
from bs4 import BeautifulSoup as bs
headers = {'Accept': '*/*',
 'Connection': 'keep-alive',
 'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683 Safari/537.36 OPR/57.0.3098.91'}
res = requests.get("https://augustasymphony.com/event/", headers=headers)
soup = bs(res.text, "lxml")
seedlist = {a["href"] for a in soup.find("div", {"id": "tribe-events-content-wrapper"}).find_all("a") if '?ical=1' not in a["href"]}
for seed in seedlist:
    res = requests.get(seed, headers=headers)
    soup = bs(res.text, "lxml")
    data = dict()
    data['event_header'] = soup.find("h2", {"class": "rhino-event-header"}).text.strip()
    data['time'] = soup.find("p", {"class": "rhino-event-time"}).text.strip()
    print(data)

Here I am generting a seedlist of event urls and then going into each one to find information.

CodePudding user response：

It's because some websites detect if it's a web browser.

So they don't send the HTML file back.

That's why there is no HTML send back