Home > Blockchain >  How can I scrape an apple HTML page using python?
How can I scrape an apple HTML page using python?

Time:09-04

I am trying to scrape the h2 tag below from the apple page in the python 3.10.6 code further below. I can see the h2 tag on the page; but my python running on PyCharm 2022.1.4 is unable to scrape it. episode-shelf-header is a unique class in the html code on this page.

I did search for a solution to this but was unable to find one.

Can anyone help?

<div  id="{{@model.id}}-{{@shelf.id}}">
    <h2 >
        Season 1
    </h2>
</div>
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://tv.apple.com/us/show/life-by-ella/umc.cmc.1suiyueh1ntwjtsstcwldofno?ctx_brand=tvs.sbd.4000')

pageSource = driver.page_source
soup = BeautifulSoup(pageSource, 'html.parser')
div = soup.find('div', attrs={'class': 'episode-shelf-header'})
h2 = div.find('h2', attrs={'class': 'typ-headline-emph'})

CodePudding user response:

You can use selenium built in methods:

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://tv.apple.com/us/show/life-by-ella/umc.cmc.1suiyueh1ntwjtsstcwldofno?ctx_brand=tvs.sbd.4000')
driver.find_element_by_class_name("episode-shelf-header").text

Output:

Out[16]: 'Season 1'

CodePudding user response:

  1. Value can be extracted directly from Selenium.
  2. You must wait for the page to fully load.

There is a sample code to extract the final value.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://tv.apple.com/us/show/life-by-ella/umc.cmc.1suiyueh1ntwjtsstcwldofno?ctx_brand=tvs.sbd.4000')
x_path = '//*[@id="{{@model.id}}-{{@shelf.id}}"]/h2'
element = WebDriverWait(driver, 10).until(lambda x: x.find_element(By.XPATH, x_path))

print(element.text)

note: selenium version: selenium 4.3.0

  • Related