Home > Software design >  What is the difference between driver.page_source and driver.execute_script('return document.do
What is the difference between driver.page_source and driver.execute_script('return document.do

Time:04-13

I'm analyzing a script written by someone else. This is a snippet of it:

    def chromedriver():
        options = webdriver.ChromeOptions()
        driver = webdriver.Chrome(options=options)
        driver.implicitly_wait(5)
        driver.maximize_window()
        return driver
    
    def pre_scraping(driver, alpha,beta):
        soup = BeautifulSoup(driver.execute_script('return document.documentElement.outerHTML'), 'html.parser')
        deck =  soup.find('div', {'id':'mainSummary'})
        cards = deck.find_all('div', {'class' : 'cp-tile'})[alpha:beta]

Instead of getting the source of the current page with the help of driver.page_source obtains the outer HTML (tag included) with the help of driver.execute_script('return document.documentElement.outerHTML'). I found entries on stackoverflow suggesting that for websites whose content changes quite quickly, it is better to use driver.execute_script than driver.page_source. Can driver.execute_script serve as a quick way to refresh specific page content? What is the difference between driver.page_source and driver.execute_script in this context? I add two links related to the question:

How to get innerHTML of whole page in selenium driver?

How to check if a web page's content has been changed using Selenium's webdriver with Python?

CodePudding user response:

Get Page Source

As per the specification:

The Get Page Source command returns a string serialization of the DOM of the current browsing context active document.

So driver.page_source gets the source of the last loaded page. If the page has been modified after loading (for example, by JavaScript or AJAX) there is no guarantee that the returned text is that of the modified page. There remains some ambiguity whether the returned text reflects the current state of the page or the text last sent by the web server. The page source returned is a representation of the underlying DOM and we shouldn't expect it to be formatted or escaped in the same way as the response sent from the web server.


Element.outerHTML

The outerHTML attribute of the Element gets the serialized HTML fragment describing the element including its descendants. It can also be set to replace the element with nodes parsed from the given string. However to only obtain the HTML representation of the contents of an element ideally you need to use the innerHTML property instead. So reading the value of outerHTML returns a DOMString containing an HTML serialization of the element and its descendants. Setting the value of outerHTML replaces the element and all of its descendants with a new DOM tree constructed by parsing the specified htmlString.


Conclusion

To conclude where as the page source obtained from driver.page_source is more or less is an artist's impression of the DOM Tree, Element.outerHTML gets the serialized HTML fragment describing the element including its descendants.

  • Related