Home > other >  Preserving formatting (\t) in scraped text - Python Selenium
Preserving formatting (\t) in scraped text - Python Selenium

Time:06-30

I have a program that takes the text from a website using this following code:

import selenium
driver = selenium.webdriver.Chrome(executable_path=r"\chromedriver.exe")
   
def get_raw_input(link_input, website_input, driver): 
    driver.get(f'{website_input}')
    try:
        here_button = driver.find_element_by_xpath('/html/body/div[2]/h3/a')
        here_button.click()
        raw_data = driver.find_element_by_xpath('/html/body/pre').text
    except:
        move_on = False
        while move_on == False:
            try:
                raw_data = driver.find_element_by_class_name('output').text
                move_on == True
            except:
                pass
    driver.close()
    return raw_data

the section of text it is targeting,is formatted like so

englishword tab frenchword

however, the return I get is in this format:

englishword space frenchword

the english part of the text could be a phrase with spaces in it, I cannot simply .split(" ") since it may split the phrase as well.

My end goal is to keep the formatting using tab instead of space so I can .split("\t") to make things easier for later manipulation.

Any help would be greatly appreciated :)

CodePudding user response:

Selenium returns element text in the way how browser renders it. So it typically "normalizes" whitespaces (all inner space symbols turn into a single space).

You can see some discussion here. The solution to get the actually spaced text suggested by Selenium guys is to query textContent property from element.

Here is the example:

raw_data = driver.find_element_by_class_name('output').get_property('textContent')
  • Related