Home > Net >  scrapy spider Crawled 0 pages. Is it bug with xpath or URL parameters?
scrapy spider Crawled 0 pages. Is it bug with xpath or URL parameters?

Time:10-30

I'm beginner with Scrapy; any guidance/hints are appreciated. I seek scraping the results data (let's say just the titles of items, for simplicity) of the following realestate page: url = "https://www.sreality.cz/en/search/for-sale/apartments/praha?disposition=2+kt&published=month&min-floor=1&max-floor=3" where the search parameters are provided in the URL (GET method).

I tried the following basic spider :

    import scrapy
    import json

    class Sp1Spider(scrapy.Spider):
        name = 'sp1'
        allowed_domains = ['www.sreality.cz']
        start_urls = ['https://www.sreality.cz/en/search/for-sale/apartments']
        

        def parse(self, response):
            apartments = response.xpath('//basci/h2/title/@content').extract()
            yield {"apartment Text ": apartments}

However, I've been failing to scrape any data or content of the destination page above, not even the page's header title!

  • I'd like, first of all, to know whether I should care about the parameters that are sent in the URL via GET method (as it's the case with POST method) or they should get scraped automatically.

P.S. The item's title is located within the xpath: '//basci/h2/title/', which contains a span with a double class "name ng-binding". I tried to workaround this issue, by scraping the whole content of the above element, so I get the tag in my results, which is OK for now.

Assistance please?

CodePudding user response:

  1. First of all, The url that you have injected in start_urls list is dynamic but the content is in static html dom

  2. If you turn off JavaScript from the browser and refresh the url then you will notice that the url has changed

  3. And the changed url that you have to use as requested url as it's statict url

  4. Your xpath expression was a bit incorrect

Working code as example:

import scrapy

class Sp1Spider(scrapy.Spider):
    name = 'sp1'
    start_urls = ['https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=']
        
    def parse(self, response):
        apartments = response.xpath('//*[@]')
        for apartment in apartments:
            title = apartment.xpath('.//*[@]/text()').get()

            yield {
                'title':title
                }

Output:

{'title': 'M. Švabinského, Bílina - Teplické Předměstí'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Na Dračkách, Praha 6'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Veleslavínova, Praha - Staré Město'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Nádražní, Žlutice'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Lumírova, Praha 2 - Nusle'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Oldřichova, Praha 2 - Nusle'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Lipno nad Vltavou, district Český Krumlov'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'J. Opletala, České Budějovice - České Budějovice 2'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Ovesná, Hostivice'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Nová výstavba, Obrnice'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Plzeň - Jižní Předměstí, district Plzeň-město'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Rychnovská, Jablonec nad Nisou - Kokonín'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Vídeňská třída, Znojmo'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Brno, district Brno-město'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Čajkovského, Karviná - Mizerov'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Mečíková, Praha 10 - Záběhlice'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Ružinovská, Praha 4 - Krč'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Jagellonská, Praha 3 - Vinohrady'}
2022-10-29 19:12:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.sreality.cz/en/search/for-sale/apartments?_escaped_fragment_=>
{'title': 'Šaldova, Praha 8 - Karlín'}
           

   
  • Related