how to scrap with scrapy by using the 'role' tag?-CodePudding

this is the code i want to scrap, i'm interest in the name class and the info class text. i didn't figure how to scrap by the 'role' tag . any idea?

main.py

  import scrapy
from ..items import UniversityItem



class UniversityLecturersSpider(scrapy.Spider):
    name = 'university_lecturers'
    allowed_domains = ['www.runi.ac.il']
    start_urls = ['https://www.runi.ac.il/en/about/management/']

    def parse(self,response):

        items=UniversityItem()
        lecturers=response.xpath('//div[@role="rowgroup"]/li/text()').extract()


        for lecturer in lecturers:

                name=lecturer.css('div.name::text').extract_first()
                job=lecturer.xpath('//div[@]/p/text()').extract_first()
       
                items['name']=name
                items['job']=job
                yield items

my item.py :

import scrapy


class UniversityItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    job = scrapy.Field()

CodePudding user response：

Your issue is that you are trying to use xpath and css expressions on strings.

Once you have called one of the methods get, getall, extract or extract_first the return value is no longer a selector and can no longer be used in chaining xpath queries. Additionally the 'rowgroup' roll is on a <li> element not a <div>, and even if lecturers was an appropriate selector for chaining, you are not using relative xpath expressions in the job selector.

You will also want to make a new item instance for each item instead of recycling the some one over and over again, because it will likely lead to overwriting previous yielded items.

What you actually want to do is closer to this:

for elem in response.xpath("//li[@role='rowgroupt']"):
    name = elem.xpath('./div[@class='name']/text()').get()
    job = elem.xpath('./div[@class='info']/p/text()').get()
    item = UniversityItem()
    item['name'] = name
    item['job'] = job
    yield item

CodePudding user response：

If you go more in details with xpaths you will find that @ sign is not only used to access the class, it is used to access any attribute of the tag.

You can loop over the list using

lecturers = response.xpath('//li[@role="rowgroup"]')
for lecturer in lecturers:
    name = lecturer.css('div.name::text').extract_first()            
    job = lecturer.css('div.info > p::text').extract_first()
    # your code