Home > database >  Extract/Save Element's Text AND Images with Selenium
Extract/Save Element's Text AND Images with Selenium

Time:05-17

Working with Windows, Python 3 and Selenium/Chromedriver, I'm trying to figure out a way to save an element's data (text AND images) to an offline file for later viewing. Things I've tried:

1. Save page source to .html file

    page_source = driver.page_source
    with open("page.html", "w", encoding="utf-8") as file:
        file.write(page_source)

Problem with this is, it only saves the text of the page, not the images, just empty image placeholders are rendered on the saved page, not the actual images.

2. Take screenshots of the entire page

    page_width = driver.execute_script('return document.body.scrollWidth')
    page_height = driver.execute_script('return document.body.scrollHeight')
    driver.set_window_size(page_width, page_height)
    driver.save_screenshot("page.png")

Problem here is, even though I'm defining the entire page height/width, only the visible section of the page is screenshot'd, not the entire page worth of data, so scrolling would need to be incorporated.

3. Use a "select all" type logic taken from this answer

This is kind of a hacky workaround, but could work, but kind of looking for a better solution.

4. Make use of pressing CTRL S to save the page and assets for offline viewing

This was ok, but it downloads a bunch of stuff into a separate folder that is needed to render the entire page, which I don't think is necessary as I only want the stuff from one element on the page. Plus I'll be downloading several pages and I don't want separate folders of stuff for each page either.

So I'm wondering if there's a better way to save the text AND images of a page element, preferably to a html, docx, or pdf file type? I've seen various solutions on SO, but haven't found one that can do this so looking for some direction/steer me in the right direction. Thanks!

CodePudding user response:

This code takes a screenshot of the element #mp-topbanner and print the text it contains

driver.get('https://en.wikipedia.org/wiki/Main_Page')
element = driver.find_element(By.CSS_SELECTOR, '#mp-topbanner')
element.screenshot('screen.png')
print(element.text)

output

Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
6,498,975 articles in English

CodePudding user response:

I ended up going with a CTRL A, CTRL C to copy all the text and images from the entire page. I then use win32clipboard to access the data from within the clipboard, and export it to a docx file. Kind of hacky, and doesn't get just the content from the element only, but it works for my purposes. Maybe someone will have a better solution in the future.

a = ActionChains(driver)
a.key_down(Keys.CONTROL).send_keys('a').key_up(Keys.CONTROL).perform()
sleep(1)
a.key_down(Keys.CONTROL).send_keys('c').key_up(Keys.CONTROL).perform()

current_books_pages_list = []
while True:
    try:
        win32clipboard.OpenClipboard()
        data = win32clipboard.GetClipboardData()
        win32clipboard.CloseClipboard()
        current_books_pages_list.append(data)
        print("Woohoo!")
        break
    except:
        print("Clipboard access denied error, trying again...")
        sleep(1)
        continue

with open("books/"   str(book_name_string)   ".docx", "a", encoding="utf-8") as file:
    for page_data in current_books_pages_list:
        file.write(page_data)
        file.write("\n")
  • Related