I have been trying to scrape this page for editorial data with scrapy
In the Editorial Board Members section there are 54 editors inside 54 div tags. But when
I try to scrape data I am getting only 10 data from 10 div tags.
len(response.css("#moreGeneralEditors>div"))
10 and the code snippet for getting data
import scrapy
class MdpjournalSpider(scrapy.Spider):
name = 'try'
start_urls = ["https://www.mdpi.com/journal/agrochemicals/editors"]
def parse(self, response):
outer_divs = response.css("div.middle-column__main.ul-spaced div.content__container>div")
for inner_divs in outer_divs:
if inner_divs.css("#moreGeneralEditors")!=[]:
divs = inner_divs.css("#moreGeneralEditors>div")
for inner_div in divs:
if inner_div.css("div.editor-div__content.img-exists")!=[]:
editor = inner_div.css("div.editor-div__content.img-exists:nth-child(2) b::text").get()
role = "editor"
yield {"editor":editor,"role":role}
elif inner_div.css("div.editor-div__content")!=[]:
editor = inner_div.css("div.editor-div__content:nth-child(1) b::text").get()
role = "editor"
yield {"editor":editor,"role":role}
editors with image and without image are in different classes. I am only concerned about this editorial board members. All the editors data in the journal have this problem. Here is the link to list of all journals all journals
CodePudding user response:
You are getting only 10 items because rest of 44 items are loaded dynamically from external source via API. So you have to use API url instead.
Example:
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
def start_requests(self):
api_url = 'https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3'
headers= {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
'x-requested-with': 'XMLHttpRequest'
}
yield scrapy.Request(url=api_url, method='GET',callback=self.parse,headers=headers)
def parse(self, response):
pass
members = response.xpath('//*[@][1]/b') response.xpath('//*[@][1]/b')
for member in members:
yield {
"editor": member.xpath('.//text()').get()
}
Output:
{'editor': ' Dr. Pasquale Comberiati'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Audrey DunnGalvin'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Monica Greco'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Inkyu Hwang'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Inaki Izquierdo'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Gisèle Kanny'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Chang Kim'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Rosario Linacero'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Soheila J. Maleki'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Giuseppe Murdaca'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Kazuyuki Nakagome'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Eleonora Nucera'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Franziska Roth-Walter'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Youn Young Shim'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Carina Gabriela Uasuf'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Joana Costa'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Magdalena Czarnecka-Operacz'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Danilo Di Bona'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Araceli Díaz -Perales'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Maria Gasset'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Elena Gimenez-Arnau'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Houman Goudarzi'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Lars Hellman'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Christiane Hilger'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Russell Hopp'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Mats W. Johansson'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Marat V. Khodoun'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Uday Kishore'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Rebecca Knibb'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Heung-Man Lee'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Isabel Mafra'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Mario Malerba'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Arduino A. Mangoni'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Nobuaki Miyahara'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Linda Monaci'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Tatsuya Moriyama'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Maria Pino-Yanes'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Daniel P. Potaczek'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Antonietta Rossi'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Ann-Marie Malby Schoos'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Gregory Seumois'}
2022-08-23 00:46:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Dr. Cenk Suphioglu'}
2022-08-23 00:46:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Junji Yodoi'}
2022-08-23 00:46:22 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.mdpi.com/journal/allergies/editors/ajax?term=10&board=3>
{'editor': ' Prof. Dr. Gianvincenzo Zuccotti'}
2022-08-23 00:46:22 [scrapy.core.engine] INFO: Closing spider (finished)
2022-08-23 00:46:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 369,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 11876,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.539094,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 8, 22, 18, 46, 22, 26301),
'httpcompression/response_bytes': 59114,
'httpcompression/response_count': 1,
'item_scraped_count': 44,