Extract text from HTML without losing the structure-CodePudding

Is there any easy way of extracting the text from the HTML source without losing structure (specifically line breaks and spaces).

Currently, I am extracting text as follows:

page_title_element = driver.find_element_by_xpath("x-path")
page_title = page_title_element.text

However, this method distorts the structure of the text.

I am using Python and Selenium.

Edit:

I am essentially trying to extract the data from the whole page (complete text data of HTML pages) and not from individual tags.

CodePudding user response：

Simply you need to access the source of element. This means getting the innerHTML information as they do with JavaScript which doesn't exist in the case of a python code.

Here's how to do it

page_title_element = driver.find_element_by_xpath("x-path")
page_title = page_title_element.source

CodePudding user response：

You have to use below code for that.

data = driver.find_element_by_xpath("//html").get_attribute("innerHTML");