I am new to scrapy and I've come across a complicated case.
My problem is that sometimes I have links like https://sitename.com/path2/?param1=value1¶m2=value2
and for me, query string is not important and I want to Drop it from requests.
I mean this part of the url:
?param1=value1¶m2=value2
After a day of research, I realized that this should be done in the middlewares.py file (Downloader Middleware) (Source). Because requests and receipts in Scrapy go through this path.
I tried to write a code so that the requests and answers are without query string, but I did not succeed.
My code does not drop requests that include query string.
middlewares.py:
from w3lib.url import url_query_cleaner
class CleanUrlAgentDownloaderMiddleware:
def process_response(self, request, response, spider):
url_query_cleaner(response.url)
return response
def process_request(self, request, spider):
url_query_cleaner(request.url)
How can I release these requests using the w3lib.url library or using Python codes? And don't enter Scrapy?
Just to let you know that I set my class in the settings.py
CodePudding user response:
Since strings are immutable, your code will not change the anything in the requests. for your code to work you have to do
from w3lib.url import url_query_cleaner
class CleanUrlAgentDownloaderMiddleware:
# No need for process response since it will have the same
# url as the request
def process_request(self, request, spider):
if "?" in request.url:
return request.replace(url=url_query_cleaner(request.url))
alternately, if you want to ignore requests that have queries in their url you can do
from scrapy.exceptions import IgnoreRequest
from urllib.parse import urlparse
class IgnoreQueryRequestMiddleware:
def process_request(self, request, spider):
if urlparse(request.url).query:
raise IgnoreRequest