this is the code i want to scrap, i'm interest in the name class and the info class text. i didn't figure how to scrap by the 'role' tag . any idea?
main.py
import scrapy
from ..items import UniversityItem
class UniversityLecturersSpider(scrapy.Spider):
name = 'university_lecturers'
allowed_domains = ['www.runi.ac.il']
start_urls = ['https://www.runi.ac.il/en/about/management/']
def parse(self,response):
items=UniversityItem()
lecturers=response.xpath('//div[@role="rowgroup"]/li/text()').extract()
for lecturer in lecturers:
name=lecturer.css('div.name::text').extract_first()
job=lecturer.xpath('//div[@]/p/text()').extract_first()
items['name']=name
items['job']=job
yield items
my item.py :
import scrapy
class UniversityItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
job = scrapy.Field()
CodePudding user response:
Your issue is that you are trying to use xpath and css expressions on strings.
Once you have called one of the methods get
, getall
, extract
or extract_first
the return value is no longer a selector and can no longer be used in chaining xpath queries. Additionally the 'rowgroup' roll is on a <li>
element not a <div>
, and even if lecturers was an appropriate selector for chaining, you are not using relative xpath expressions in the job selector.
You will also want to make a new item instance for each item instead of recycling the some one over and over again, because it will likely lead to overwriting previous yielded items.
What you actually want to do is closer to this:
for elem in response.xpath("//li[@role='rowgroupt']"):
name = elem.xpath('./div[@class='name']/text()').get()
job = elem.xpath('./div[@class='info']/p/text()').get()
item = UniversityItem()
item['name'] = name
item['job'] = job
yield item
CodePudding user response:
If you go more in details with xpaths you will find that @
sign is not only used to access the class, it is used to access any attribute of the tag.
You can loop over the list using
lecturers = response.xpath('//li[@role="rowgroup"]')
for lecturer in lecturers:
name = lecturer.css('div.name::text').extract_first()
job = lecturer.css('div.info > p::text').extract_first()
# your code