I am trying to use selenium as middleware in scrapy.
One issue is when I use the ImagesDownloader all my downloaded images are invalid and contain HTML. A bit of debugging leads me to this:
# python3
Python 3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> u = 'https://www.gravatar.com/avatar/?s=256&d=identicon&r=PG&f=1'
>>> from selenium import webdriver
>>> driver = webdriver.Firefox()
>>> driver.get(u)
>>> driver.page_source
'<html><head><meta name="viewport" content="width=device-width; height=device-height;"><link rel="stylesheet" href="resource://content-accessible/ImageDocument.css"><link rel="stylesheet" href="resource://content-accessible/TopLevelImageDocument.css"><title>(PNG Image, 256 × 256 pixels)</title></head><body><img src="https://www.gravatar.com/avatar/?s=256&d=identicon&r=PG&f=1" alt="https://www.gravatar.com/avatar/?s=256&d=identicon&r=PG&f=1" ></body></html>'
>>>
Note that the url in the variable u is my avatar image, a binary image. However when looking at the page_source we see HTML created by firefox (not stackoverflow) used to display the image in the browser.
Questions: How can I get the raw image content and how can I know if I should retrieve page_source or the raw image content?
Note: The Chrome driver has similar results.
CodePudding user response:
Selenium is not built for such tasks and does not have native support for extraction of details about raw content or communication such as HTTP headers. Selenium is a method to automate browser tasks that humans would do, and humans seldom look at such details.
There are however two main ways to try to solve this issue.
- By forcing the browser used in selenium to go through a HTTP proxy, allowing the proxy to capture details about the protocol & raw content, while still giving access to the browser handling of dynamic content. The most prominent seems to be selenium-wire: https://pypi.org/project/selenium-wire/
- By using javascript execution feature in selenium, this way we can get much more details - but are limited to what the javascript engine has access to.