purpose: using selenium get entire page source.
problem: loaded page does not contain content, only JavaScript files and css files.
target site : https://www.warcraftlogs.com
test code(need 'pip install selenium'):
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.warcraftlogs.com/zone/rankings/29#boss=2512&metric=hps&difficulty=3&class=Priest&spec=Discipline")
pageSource = driver.page_source
fileToWrite = open("page_source.html", "w",encoding='utf-8')
fileToWrite.write(pageSource)
fileToWrite.close()
trythings--
- try python request code, same result. that did't contain content only js,css things
It's a personal opinion, this site deliberated hide contant data.
i wanna do scriping this site data,
how can i do?
CodePudding user response:
Here is a way of getting the page source, after all elements loaded:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
[...]
wait = WebDriverWait(driver, 5)
url='https://www.warcraftlogs.com/zone/rankings/29#boss=2512&metric=hps&difficulty=3&class=Priest&spec=Discipline'
driver.get(url)
stuffs = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div[@]')))
t.sleep(5)
print(driver.page_source)
You can then write page source to file, etc. Selenium documentation: https://www.selenium.dev/documentation/