Home > Software engineering >  Extractring <li> and <ul> using scrapy
Extractring <li> and <ul> using scrapy

Time:06-24

I'm new to Scrapy but I'm running into an issue forming an accurate selector based on scrapy's tutorial code basically I'm trying to list all business,their Address and their website. But when I try to list them only one result comes out (if i set all of them to getall then i'm getting all of them just they are thrown there randomly and i need them in format:

{"address": "mazowieckie, Warszawa", "name": "Dom Development S.A.", "link": "domd.pl"})

Here is code that I use:



class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1',
    ]

    def parse(self, response):
        for quote in response.css('ul.rp-1qtpzi4'):
            yield {
                'address': quote.css('address.rp-o9b83y::text').get(),
                'name': quote.css('h2.rp-69f2r4::text').get(),
                'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
            }

        ``` 
Thanks in advance.

CodePudding user response:

You are getting only one output because the element selection/locator strategy ul.rp-1qtpzi4 is incorrect meaning it's not selecting all the lising from the entire page but the correct selection like
.rp-y89gny.eboilu01 ul li select all 24 items

import scrapy
from scrapy.crawler import CrawlerProcess

class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1']

    def parse(self, response):
        for quote in response.css('.rp-y89gny.eboilu01 ul li'):
            yield {
                'address': quote.css('address.rp-o9b83y::text').get(),
                'name': quote.css('h2.rp-69f2r4::text').get(),
                'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
            }

    
if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl(RynekMainSpider)
    process.start()

Output:

{'address': 'mazowieckie, Warszawa', 'name': 'Dom Development S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/dom-development-sa-955/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Ronson Development Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/ronson-development-sp-z-oo-863/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Echo Investment S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/echo-investment-sa-7478/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław, Psie Pole', 'name': 'INTER-ES Deweloper', 'link': 'https://rynekpierwotny.pl/deweloperzy/inter-es-deweloper-928/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'śląskie, Bielsko-Biała', 'name': 'Murapol S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/murapol-sa-884/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Robyg S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/robyg-sa-888/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'śląskie, cieszyński, Cieszyn', 'name': 'ATAL S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/atal-sa-1084/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'zachodniopomorskie, Szczecin', 'name': 'Assethome – Przedstawiciel Dewelopera', 'link': 'https://rynekpierwotny.pl/deweloperzy/asset-home-przedstawiciel-dewelopera-7429/'}    
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Hreit', 'link': 'https://rynekpierwotny.pl/deweloperzy/hreit-7892/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław', 'name': 'Develia', 'link': 'https://rynekpierwotny.pl/deweloperzy/develia-1048/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław, Fabryczna', 'name': 'PROFIT Development', 'link': 'https://rynekpierwotny.pl/deweloperzy/profit-development-940/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Novisa Development Sp. z o.o. Sp. J.', 'link': 'https://rynekpierwotny.pl/deweloperzy/novisa-development-sp-z-oo-sp-j-484/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'pomorskie, Gdańsk', 'name': 'Robyg', 'link': 'https://rynekpierwotny.pl/deweloperzy/robyg-grupa-deweloperska-4251/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Arche S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/arche-sp-z-oo-934/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'warmińsko-mazurskie, ełcki, Ełk', 'name': 'Rutkowski Development Sp. J.', 'link': 'https://rynekpierwotny.pl/deweloperzy/rutkowski-development-sp-j-1846/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Cordia Polska', 'link': 'https://rynekpierwotny.pl/deweloperzy/cordia-polska-3824/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Budlex Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/budlex-sp-z-oo-1684/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'pomorskie, Gdańsk', 'name': 'Euro Styl S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/euro-styl-sa-964/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'łódzkie, Skierniewice', 'name': 'JHM DEVELOPMENT S.A.', 'link': 'https://rynekpierwotny.pl/deweloperzy/jhm-development-sa-892/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'dolnośląskie, Wrocław', 'name': 'Lokum Deweloper', 'link': 'https://rynekpierwotny.pl/deweloperzy/lokum-deweloper-948/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'podlaskie, Łomża', 'name': 'Eldor Bud Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/eldor-bud-sp-z-oo-4355/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Nexity Polska Sp. z o.o.', 'link': 'https://rynekpierwotny.pl/deweloperzy/nexity-polska-sp-z-oo-2856/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'mazowieckie, Warszawa', 'name': 'Spravia', 'link': 'https://rynekpierwotny.pl/deweloperzy/spravia-1236/'}
2022-06-24 02:16:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rynekpierwotny.pl/deweloperzy/>
{'address': 'małopolskie, Kraków', 'name': 'Bryksy', 'link': 'https://rynekpierwotny.pl/deweloperzy/bryksy-914/'}

 'item_scraped_count': 24,,

CodePudding user response:

response.css('ul.rp-1qtpzi4') will get you the container of the items, and not the items (li tag) themselves. So you're looping over the container (once) and getting just the first item.

Change it to:

import scrapy


class RynekMainSpider(scrapy.Spider):
    name = "RynekMain"
    start_urls = [
        'https://rynekpierwotny.pl/deweloperzy/?page=1',
    ]

    def parse(self, response):
        for quote in response.css('ul.rp-1qtpzi4 li'):
            yield {
                'address': quote.css('address.rp-o9b83y::text').get(),
                'name': quote.css('h2.rp-69f2r4::text').get(),
                'link': quote.css('li.rp-np9kb1 a::attr(href)').get(),
            }
  • Related