Scraping results from multiple pages into one item using Scrapy-CodePudding

How can I scrape results from multiple pages into one item using Scrapy?

The pages that should be considered:

an original page o (e.g. given by start_requests())
all pages url in urls, where urls is a field created by scraping o according to parse().

Note that urls for different o might not be disjoint.

Specific example

I have a spider that yields the following fields for an item `i`, i.e. for a scraped page:

id
prio
urls

urls is a list of urls, and for each url (that is not dead) I want to scrape some information from url to extend i's fields

image_list
head_list

Finally, I want to filter the resulting items so that for each id, only the item with the highest prio is kept.

What I have tried

Since I have read that all scraping should be done inside a spider (as opposed to e.g. inside an items pipeline component), I thought the best approach would be to separate the scraping from the post processing by:

use a spider that collects all data from a start page, parses the data via parse into i, and then calls response.follow(url, callback=self.parse_given_url, meta={'item':i}) for each url in i's urls
parse_given_url will extract the metadata into i, parse the given url, and add image_list and head_list to i
do all post processing (merge and filter) on all the scraped data via item pipeline components to get all final items.

A minimal reproducible example of my approach:

import scrapy

class Minimal(scrapy.Spider):
    name = "minimal"

    def start_requests(self):
        url = 'https://www.arztsuche-bw.de/index.php?suchen=1&id_fachgruppe=441&arztgruppe=facharzt&plz=761&direction=ASC'
        yield scrapy.Request(url=url, method="POST", callback=self.parse)

    def parse(self, response):
        for office in response.css('li.row.resultrow.even')   response.css('li.row.resultrow.odd'):
            full_name = office.css('dd.name dl').xpath('string(.//dt[1])').get()
            contact_selectors = office.css('dd.adresse dl dd')
            urls = contact_selectors.xpath('.//a[@title="Homepage aufrufen"]/@href').getall()
            office_data = {
                'name': full_name,
                'url': urls,
            }
            if urls:
                for url in urls:
                    yield response.follow(url, callback=self.parse_hp, meta={'item':office_data})
            else:
                yield office_data

    def parse_hp(self, response):
        office_data = response.meta['item']

        return {
            **office_data,
            'hp_head': response.xpath('//h1/text()').get(),
            'hp_logo_image': response.xpath('//img/@src').get(),
        }

However, since the urls fields from different items are not disjoint, some requests from response.follow_all() calls are dropped, so resulting items are missing. I could add the argument dont_filter=True to the response.follow_all() calls, but then a url might be scraped multiple times, which I would like to avoid. Thus I have the feeling my approach is not right one.

CodePudding user response：

To combine info from the main website with info picked from individual clinics' websites, you can do the following (EDIT: included custom_settings,as well as redirection to 'google.com' for the ones without a website, and now it will yield 56 results out of 63 - needs further debugging):

import scrapy
from german_medical.items import GermanMedicalItem

class DoctorsSpider(scrapy.Spider):
    name = 'doctors'
    custom_settings = {
        'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
    }
    allowed_domains = []
    start_urls = ['https://www.arztsuche-bw.de/index.php?suchen=1&offset=0&id_z_arzt_praxis=0&id_fachgruppe=441&id_zusatzbezeichnung=0&id_genehmigung=0&id_dmp=0&id_zusatzvertraege=0&id_sprache=0&vorname=&nachname=ohne Titel (Dr.)&arztgruppe=facharzt&geschlecht=alle&wochentag=alle&zeiten=alle&fa_name=&plz=761&ort=&strasse=&schluesselnr=&schluesseltyp=lanr7&landkreis=&id_leistungsort_art=0&id_praxis_zusatz=0&sorting=name&direction=ASC&checkbox_content=&name_schnellsuche=&fachgebiet_schnellsuche=']
    offset = 20
    def parse(self, response):
        doctor_cards = response.xpath('//ul[contains(@class, "resultlist")]/li[contains(@class, "resultrow")]')
        for d in doctor_cards:
            full_name = ' '.join(d.xpath('.//dd[@]/dl/dt/text()').extract())
            address = ', '.join(d.xpath('.//dd[@]/p[@]/text()').extract()[1:])
            urls = [x for x in d.xpath('.//dd[@]/p[@]/following-sibling::dl//a/@href').extract() if 'mailto:' not in x ]
            resp_meta = {
                'full_name': full_name,
                'address': address,
                'urls': urls 
            }
            if not urls:
                urls = ['https://google.com']
            for url in urls:
                print(url)    
                yield response.follow(url=url, callback = self.parse_doctor_clinik, meta = resp_meta)

        next_page = 'https://www.arztsuche-bw.de/index.php?suchen=1&offset='   str(self.offset)   '&id_z_arzt_praxis=0&id_fachgruppe=441&id_zusatzbezeichnung=0&id_genehmigung=0&id_dmp=0&id_zusatzvertraege=0&id_sprache=0&vorname=&nachname=ohne Titel (Dr.)&arztgruppe=facharzt&geschlecht=alle&wochentag=alle&zeiten=alle&fa_name=&plz=761&ort=&strasse=&schluesselnr=&schluesseltyp=lanr7&landkreis=&id_leistungsort_art=0&id_praxis_zusatz=0&sorting=name&direction=ASC&checkbox_content=&name_schnellsuche=&fachgebiet_schnellsuche='
        print(next_page)
        if self.offset < 80:
            self.offset  = 20
            yield response.follow(next_page, callback = self.parse)
    
    def parse_doctor_clinik(self, response):
        items  = GermanMedicalItem()
        try:
            website_header = response.xpath('//h1/text()').get() if response.xpath('//h1/text()') else None
            logo_url = response.xpath('//img/@src').get() if response.xpath('//img/@src') else None
        except Exception as e:
            website_header = 'Not specified'
            logo_url = 'Not specified'
        items['full_name'] = response.request.meta['full_name']
        items['address'] = response.request.meta['address']
        items['office_urls'] = response.request.meta['urls']
        items['website_header'] = website_header
        items['logo_url'] = logo_url

        yield items

Your items.py file should look like:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class GermanMedicalItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    full_name = scrapy.Field()
    office_urls = scrapy.Field()
    address = scrapy.Field()
    website_header = scrapy.Field()
    logo_url = scrapy.Field()

Run with scrapy crawl doctors -o doctors_germ.json, and you get a json file like:

[
{"full_name": "Dr. med. Jan Gestrich Sprechstundenzeiten ", "address": "Zeppelinstr. 2, 76185 Karlsruhe, Ortsteil: Gr\u00fcnwinkel, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.ka-nephrologie.de"], "website_header": "Diagnostik und Therapie in unserer Nephrologischen Praxis", "logo_url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAC0lEQVQYV2NgAAIAAAUAAarVyFEAAAAASUVORK5CYII="},
{"full_name": "Dr. med. Martin Andre Sprechstundenzeiten ", "address": "S\u00fcdendstr. 47-49, 76137 Karlsruhe, Ortsteil: S\u00fcdweststadt, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.nephrologie-karlsruhe.de"], "website_header": null, "logo_url": "https://static.wixstatic.com/media/689a07_b6517c8c92574851a08a4b37c9a23142~mv2.jpg/v1/fill/w_101,h_72,al_c,q_80,usm_0.66_1.00_0.01,enc_auto/Logo_Nephro_neu.jpg"},
{"full_name": "Dr. med. Kathrin Drognitz Sprechstundenzeiten ", "address": "Moltkestr. 90, 76133 Karlsruhe, Ortsteil: Nordstadt, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.klinikum-karlsruhe.de/einrichtungen/spezielle-medizinische-einrichtungen/"], "website_header": "Spezielle medizinische Einrichtungen", "logo_url": "data:image/svg xml;charset=utf-8,"},
{"full_name": "Dr. med. Thorsten Dorn Sprechstundenzeiten ", "address": "Kriegsstr. 140, 76133 Karlsruhe, Ortsteil: Innenstadt-West, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.hormone-karlsruhe.de"], "website_header": null, "logo_url": "/templates/web_joomla_neu/images/spacer.gif"},
{"full_name": "Dr. med. Wilhelm Hausch Sprechstundenzeiten ", "address": "Lammstr. 21, 76133 Karlsruhe, Ortsteil: Innenstadt-West, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.gastroenterologie-karlsruhe.de"], "website_header": "Herzlich Willkommen in der Praxis f\u00fcr Gastroenterologie am Ettlinger Tor.", "logo_url": "/assets/asset.babb34fd.png"},
{"full_name": "Dr. med. Norbert Bruhn Sprechstundenzeiten ", "address": "Gartenstr. 71, 76135 Karlsruhe, Ortsteil: S\u00fcdweststadt, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.praxis-bruhn.com"], "website_header": null, "logo_url": "https://www.praxis-bruhn.com/s/img/emotionheader7307447.jpg?1472391703.667px.483px"},
{"full_name": "Dr. med. Kurt Beier Sprechstundenzeiten ", "address": "Ludwig-Erhard-Allee 24, 76131 Karlsruhe, Ortsteil: Innenstadt-Ost, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.deRossi.de", "https://www.medGAIN.de"], "website_header": "\r\n\t\t\t\t\r\n\t\t\t\t\tmedGAIN | Praxis Dr. med. Thomas de Rossi und Kollegen\r\n\t\t\t\t\r\n\t\t\t\t", "logo_url": "img/med_gain_logo.svg"},
{"full_name": "Dr. med. Kai Haberl Sprechstundenzeiten ", "address": "Waldstra\u00dfe 41-43, 76133 Karlsruhe, Ortsteil: Innenstadt-West, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.kardiologie-waldstrasse.de"], "website_header": " Unser Team hei\u00dft Sie herzlich willkommen! ", "logo_url": "images/logo_kardiologie_karlsruhe.svg"},
{"full_name": "Dr. med. Lutz Krieglstein Sprechstundenzeiten ", "address": "Hans-Sachs-Str. 1, 76133 Karlsruhe, Ortsteil: Weststadt, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.praxis-muehlburger-tor.de"], "website_header": "Gastroenterologische Gemeinschaftspraxis in Karlsruhe", "logo_url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAC0lEQVQYV2NgAAIAAAUAAarVyFEAAAAASUVORK5CYII="},
{"full_name": "Dr. med. Mirko Krivokuca Sprechstundenzeiten ", "address": "Kaiserallee 30, 76185 Karlsruhe, Ortsteil: Weststadt, Landkreis: Karlsruhe - Stadt", "office_urls": ["https://www.kardiologie-musikerviertel.de"], "website_header": "Fieber\n?\u00a0\u00a0\u00a0 Husten?\u00a0\u00a0\u00a0 Atemwegsinfekt?", "logo_url": "https://image.jimcdn.com/app/cms/image/transf/none/path/sb3d393a4e68b5222/image/i855f937e8779839c/version/1608138272/image.jpg"},
....
    ]