I'm trying to scrape one site which partially renders content using JS.
I went ahead and found this project: https://github.com/scrapinghub/sample-projects/tree/master/splash_smart_proxy_manager_example, which quite neatly explains how to set things out. Here's what I have right now:
Docker compose:
version: '3.8'
services:
scraping:
build:
context: .
dockerfile: Dockerfile
volumes:
- "./scraping:/scraping"
environment:
- PYTHONUNBUFFERED=1
depends_on:
- splash
links:
- splash
splash:
image: scrapinghub/splash
restart: always
expose:
- 5023
- 8050
- 8051
ports:
- "5023:5023"
- "8050:8050"
- "8051:8051"
spider:
class HappySider(scrapy.Spider):
...
custom_settings = {
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'SPIDER_MIDDLEWARES': {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
'DOWNLOADER_MIDDLEWARES': {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
'ITEM_PIPELINES': {
'scraping.pipelines.HappySpiderPipeline': 300,
},
'RETRY_HTTP_CODES': [500, 502, 503, 504, 522, 524, 408, 429, 403],
'RETRY_TIMES': 20,
'DOWNLOAD_DELAY': 5,
'DOWNLOAD_TIMEOUT': 30,
'CONCURRENT_REQUESTS': 1,
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'COOKIES_ENABLED': False,
'ROBOTSTXT_OBEY': True,
# enable Zyte Proxy
'ZYTE_SMARTPROXY_ENABLED': True,
# the APIkey you get with your subscription
'ZYTE_SMARTPROXY_APIKEY': '<my key>',
'SPLASH_URL': 'http://splash:8050/',
}
def __init__(self, testing=False, name=None, **kwargs):
self.LUA_SOURCE = get_data(
'scraping', 'scripts/smart_proxy_manager.lua'
).decode('utf-8')
super().__init__(name, **kwargs)
def start_requests(self):
yield SplashRequest(
url='https://www.someawesomesi.te',
endpoint='execute',
args={
'lua_source': self.LUA_SOURCE,
'crawlera_user': self.settings['ZYTE_SMARTPROXY_APIKEY'],
'timeout': 90,
},
# tell Splash to cache the lua script, to avoid sending it for every request
cache_args=['lua_source'],
meta={
'max_retry_times': 10,
},
callback=self.my_callback
)
And the output I get is:
2022-08-10 13:09:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.someawesomesi.te via http://splash:8050/execute> (failed 1 times): 504 Gateway Time-out
Not sure how to proceed here. I did look out why it would be giving 504 to me and splash docks does suggest some ways of handling it... but I don't have many concurrent URLs and the script fails with the very first one. Plus, the site I'm scraping is very fast, and if I just use Zyte without splash, then it scrapes very fast.
So If anybody can suggest what's wrong here and how to fix it - I'd greatly appreciate it.
CodePudding user response:
This example did not work out of the box for me either. Changing Zyte Smart Proxy Manager's port number specified in splash_smart_proxy_manager_example/scripts/smart_proxy_manager.lua to 8010 helped.
local port = 8010
8010 was used in the older example
CodePudding user response:
Splash is getting deprecated soon. You can use headless browser libraries for rendering JS along with Smart Proxy Manager. Zyte recently launched three headless browser libraries.
These client libraries are built on top of their native libraries for web automation across Chromium, Firefox, and WebKit, written to work seamlessly with Zyte Smart Proxy Manager. Using these library, you will no longer have to maintain a separate piece of software(like splash) running in the background to help connect with Zyte Smart Proxy Manager.
- My recommendation would be to use Zyte API. Zyte API is an end-to-end API solution that executes all tasks in the web-scraping sequence. It can extract dynamically-loaded web page content without spending time recreating what the browser does through JavaScript, headless browser libraries and additional requests.Just Set
javascript
parameter: to
Turn JavaScript ON or OFF during browser rendering. And it just works...
I work as a Developer Advocate @zyte.