Home > OS >  Identify and save image content in selenium python3
Identify and save image content in selenium python3

Time:08-26

I am trying to use selenium as middleware in scrapy.

One issue is when I use the ImagesDownloader all my downloaded images are invalid and contain HTML. A bit of debugging leads me to this:

# python3
Python 3.8.10 (default, Jun 22 2022, 20:18:18) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> u = 'https://www.gravatar.com/avatar/?s=256&d=identicon&r=PG&f=1'
>>> from selenium import webdriver
>>> driver = webdriver.Firefox()
>>> driver.get(u)
>>> driver.page_source
'<html><head><meta name="viewport" content="width=device-width; height=device-height;"><link rel="stylesheet" href="resource://content-accessible/ImageDocument.css"><link rel="stylesheet" href="resource://content-accessible/TopLevelImageDocument.css"><title>(PNG Image, 256&nbsp;×&nbsp;256 pixels)</title></head><body><img src="https://www.gravatar.com/avatar/?s=256&amp;d=identicon&amp;r=PG&amp;f=1" alt="https://www.gravatar.com/avatar/?s=256&amp;d=identicon&amp;r=PG&amp;f=1" ></body></html>'
>>> 

Note that the url in the variable u is my avatar image, a binary image. However when looking at the page_source we see HTML created by firefox (not stackoverflow) used to display the image in the browser.

Questions: How can I get the raw image content and how can I know if I should retrieve page_source or the raw image content?

Note: The Chrome driver has similar results.

CodePudding user response:

Selenium is not built for such tasks and does not have native support for extraction of details about raw content or communication such as HTTP headers. Selenium is a method to automate browser tasks that humans would do, and humans seldom look at such details.

There are however two main ways to try to solve this issue.

  1. By forcing the browser used in selenium to go through a HTTP proxy, allowing the proxy to capture details about the protocol & raw content, while still giving access to the browser handling of dynamic content. The most prominent seems to be selenium-wire: https://pypi.org/project/selenium-wire/
  2. By using javascript execution feature in selenium, this way we can get much more details - but are limited to what the javascript engine has access to.
  • Related