Home > Net >  Extract text from HTML without losing the structure
Extract text from HTML without losing the structure

Time:07-05

Is there any easy way of extracting the text from the HTML source without losing structure (specifically line breaks and spaces).

Currently, I am extracting text as follows:

page_title_element = driver.find_element_by_xpath("x-path")
page_title = page_title_element.text

However, this method distorts the structure of the text.

I am using Python and Selenium.

Edit:

I am essentially trying to extract the data from the whole page (complete text data of HTML pages) and not from individual tags.

CodePudding user response:

Simply you need to access the source of element. This means getting the innerHTML information as they do with JavaScript which doesn't exist in the case of a python code.

Here's how to do it

page_title_element = driver.find_element_by_xpath("x-path")
page_title = page_title_element.source

CodePudding user response:

You have to use below code for that.

data = driver.find_element_by_xpath("//html").get_attribute("innerHTML");
  • Related