Home > Software design >  How to use Scrapy on a website that does not change the URL when changing language
How to use Scrapy on a website that does not change the URL when changing language

Time:04-29

As far as I can see when the language button is pressed, this website https://www.learnit.nl/ fetches the english version by sending a POST Request to https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1 and I dont know how to replicate with Scrapy. I'll appreciate any help.

CodePudding user response:

Data is in API calls json response with post method where payload is a big json and how to replicate with Scrapy, you can follow the next example:

import json
import scrapy

class CourseSpider(scrapy.Spider):

    name = 'course'
    body = add payload here

    def start_requests(self):
        yield scrapy.Request(
            url='https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1',
            callback=self.parse,
            body=json.dumps(self.body),
            method="POST",
            headers={

            }
        )

    def parse(self, response):
        response = json.loads(response.body)
       

        for resp in response['to_words']:
            yield {
                'course': resp
                }

Output:

{'course': 'Writing clear texts'}
2022-04-28 22:03:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1>
{'course': 'HTML e-mail'}
2022-04-28 22:03:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1>
{'course': 'HTML and CSS Basics'}
2022-04-28 22:03:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1>
{'course': 'HTML and CSS Continued'}
2022-04-28 22:03:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://cdn-api-weglot.com/translate?api_key=wg_6199f2422428fc4285eb776a1ab915c08&v=1>
{'course': 'HTML Training E-learning'}

 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.879555,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 28, 16, 3, 22, 536326),
 'httpcompression/response_bytes': 36269,
 'httpcompression/response_count': 1,
 'item_scraped_count': 514,

... so on

As payload is a big json and can't post here as outof limit. Full working code here

  • Related