Download entire webpage as HTML (including the HTML assets) without save as pop up using Selenium an-CodePudding

I am trying to scrape a website and download all the webpages as .html files (including all the HTML assets) so that the locally downloaded page opens just like the same in the server.

Currently using Selenium, Chrome Webdriver, and Python.

Approach:

I tried updating the prefs of the chrome browser. And then login into the website. After logging in I want to download the webpage similarly we do download by clicking ctrl s from the keyboard.

Below code opens the desired page I want to download but does not disable Windows's save as a pop-up and neither downloads the page to the specified path.

from selenium import webdriver 
import pyautogui
chrome_options = webdriver.ChromeOptions()
preferences = {
"download.default_directory":"C:\\Users\\pathtodir",                
"download.prompt_for_download": False,               
"download.directory_upgrade": True,                
"safebrowsing.enabled": True
} 
chrome_options.add_experimental_option("prefs", preferences)     
driver = webdriver.Chrome(options=chrome_options)
driver.get(***URL to the website***)
driver.find_element("xpath", '//*[@id="id_username"]').send_keys('username')  
driver.find_element("xpath", '//*[@id="id_password"]').send_keys('password') 
driver.find_element("xpath", '//*[@id="datagrid- 
0"]/div[2]/div[1]/div[1]/table/tbody/tr[1]/td[2]/a').click()
pyautogui.hotkey('ctrl', 's') 
pyautogui.typewrite('hello1'   '.html')     
pyautogui.hotkey('enter')

Can somebody please help me to understand what I am doing wrong? Please suggest if there is any other alternative library that can be used in python.

CodePudding user response：

To save a page first obtain the page source behind the webpage with the help of the page_source method.

Then open a file with a particular encoding with the codecs.open method. The file has to be opened in the write mode represented by w and encoding type as utf−8. Then use the write method to write the content obtained from the page_source method.

from selenium import webdriver
import codecs


driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
driver.implicitly_wait(0.5)

driver.get(***URL to the website***)
h = driver.page_source

n=os.path.join("C:\ANYPATH","Page.html")
f = codecs.open(n, "w", "utf−8")

f.write(h)
driver.quit()

CodePudding user response：

I was able to fix the issue, the main problem was that my program was quitting before the browser was able to download the file. Adding time.sleep() fixed it.

Updated code:

from selenium import webdriver 
import pyautogui

driver.get(***URL to the website***)
driver.find_element("xpath", '//*[@id="id_username"]').send_keys('username')  
driver.find_element("xpath", '//*[@id="id_password"]').send_keys('password') 
driver.find_element("xpath", '//*[@id="datagrid- 
0"]/div[2]/div[1]/div[1]/table/tbody/tr[1]/td[2]/a').click()
FILE_NAME = r'C:\ANYPATH\Page.html' 
pyautogui.typewrite(FILE_NAME)
pyautogui.press('enter') 
time.sleep(10) 
driver.quit()