Home > Software engineering >  Drop requests that include query string in Scarpy
Drop requests that include query string in Scarpy

Time:08-19

I am new to scrapy and I've come across a complicated case.

My problem is that sometimes I have links like https://sitename.com/path2/?param1=value1&param2=value2 and for me, query string is not important and I want to Drop it from requests.
I mean this part of the url: ?param1=value1&param2=value2

After a day of research, I realized that this should be done in the middlewares.py file (Downloader Middleware) (Source). Because requests and receipts in Scrapy go through this path.
I tried to write a code so that the requests and answers are without query string, but I did not succeed.
My code does not drop requests that include query string.
middlewares.py:

from w3lib.url import url_query_cleaner

class CleanUrlAgentDownloaderMiddleware:

    def process_response(self, request, response, spider):
        url_query_cleaner(response.url)
        return response

    def process_request(self, request, spider):
        url_query_cleaner(request.url)

How can I release these requests using the w3lib.url library or using Python codes? And don't enter Scrapy?
Just to let you know that I set my class in the settings.py

CodePudding user response:

Since strings are immutable, your code will not change the anything in the requests. for your code to work you have to do

from w3lib.url import url_query_cleaner

class CleanUrlAgentDownloaderMiddleware:
    # No need for process response since it will have the same 
    # url as the request

    def process_request(self, request, spider):
        if "?" in request.url:
            return request.replace(url=url_query_cleaner(request.url))

alternately, if you want to ignore requests that have queries in their url you can do

from scrapy.exceptions import IgnoreRequest
from urllib.parse import urlparse

class IgnoreQueryRequestMiddleware:
    def process_request(self, request, spider):
        if urlparse(request.url).query:
            raise IgnoreRequest
  • Related