Home > database >  Getting error when sending request to a website using Scrapy shell
Getting error when sending request to a website using Scrapy shell

Time:12-01

I was learning Scrapy framework. I tried to use scrapy shell. There I was trying to fetch response from "https://quotes.toscrape.com/". The commands are below-

python -m scrapy shell

Inside the shell-

>> from scrapy import Request
>> req = Request("https://quotes.toscrape.com/")
>> fetch(req)

Then I found the error like this-

PS D:\Projects\scrapyLearn\introSpider\introSpider> python -m scrapy shell
2022-11-30 15:04:52 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: introSpider)
2022-11-30 15:04:52 [scrapy.utils.log] INFO: Versions: lxml 4.9.0.0, libxml2 2.9.10, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.0, Twisted 22.10.0, Python 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.22000-SP0
2022-11-30 15:04:52 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'introSpider',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'introSpider.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['introSpider.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2022-11-30 15:04:52 [asyncio] DEBUG: Using selector: SelectSelector
2022-11-30 15:04:52 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2022-11-30 15:04:52 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop2022-11-30 15:04:52 [scrapy.extensions.telnet] INFO: Telnet Password: 9ec5c326bbb22c54
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-11-30 15:04:52 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-11-30 15:04:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000002601B1B48D0>
[s]   item       {}
[s]   settings   <scrapy.settings.Settings object at 0x000002601B3EC550>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> from scrapy import Request
>>> req = Request("https://quotes.toscrape.com/")
>>> fetch(req)
2022-11-30 15:05:46 [scrapy.core.engine] INFO: Spider opened
2022-11-30 15:05:47 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://quotes.toscrape.com/robots.txt> (referer: None)
2022-11-30 15:05:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/> (referer: None)
>>> 2022-11-30 15:05:47 [scrapy.core.scraper] ERROR: Spider error processing <GET https://quotes.toscrape.com/> (referer: None)
Traceback (most recent call last):
  File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\utils\defer.py", line 285, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\scrapy\utils\defer.py", line 272, in deferred_from_coro
    event_loop = get_asyncio_event_loop_policy().get_event_loop()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\asyncio\events.py", line 677, in get_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'Thread-1 (start)'.
2022-11-30 15:05:47 [py.warnings] WARNING: C:\Users\arnoLiono\AppData\Local\Programs\Python\Python311\Lib\site-packages\twisted\internet\defer.py:892: RuntimeWarning: coroutine 'SpiderMiddlewareManager.scrape_response.<locals>.process_callback_output' was never awaited
  current.result = callback(  # type: ignore[misc]


And the shell is still running. I don't know what is error is. And how to fix it.

I was just trying to get the response from "https://quotes.toscrape.com/" website.

CodePudding user response:

If you are using windows. This is caused by a bug.

Here is the github issue.

This has absolutely nothing to do with the robots.txt file.

CodePudding user response:

I recreated the same steps and had no problem getting the page. I would recommend you to change this setting in the settings.py: ROBOTSTXT_OBEY = False because as you can see in the logs, scrapy receives a 404 (error) when making a first request to https://quotes.toscrape.com/robots.txt that doesnt exists.

I would also recommend that you use fetch directly with the url as an argument, example: fetch("https://quotes.toscrape.com/")

  • Related