Home > OS >  How to scrape all URLs using Scrapy?
How to scrape all URLs using Scrapy?

Time:08-26

I tried to get the URL of the search result articles in these ways:

selector = response.xpath("//*[contains(@class, 'bw-news-list')]/a/@href").extract()
selector = response.xpath("//*[contains(@class, 'bw-search-results')]/a/@href").extract()
selector = response.css('ul.bw-news-list a::attr(href)')

but I'm not able to get any.

This is the Site URL: screenshot of elements

CodePudding user response:

You are getting empty ResultSet, because the webpage is loaded dynamically from external source(API) as AJAX request. So you have to use API url instead.

Working solution as an example:

import scrapy
class TestSpider(scrapy.Spider):
    name = 'test'
   
    def start_requests(self):
        headers= {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36',
            'x-requested-with': 'XMLHttpRequest'
        }

        url='https://www.businesswire.com/portal/site/home/template.BINARYPORTLET/search/resource.process/?javax.portlet.tpst=92055fbcbec7e639f1f554100d908a0c&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchTerm=amputee&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_resultsPage=1&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchType=news&javax.portlet.rid_92055fbcbec7e639f1f554100d908a0c=searchPaging&javax.portlet.rcl_92055fbcbec7e639f1f554100d908a0c=cacheLevelPage&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken'
        yield scrapy.Request(
            url=url,
            headers=headers,
            callback= self.parse,
            method="GET")


    def parse(self, response):
        for link in response.xpath('(//*[@])[1]/li/h3/a/@href'):
            yield {
                'URL': link.get()
            }
       
    

Output:

{'URL': 'http://www.businesswire.com/news/home/20220810005639/en/The-Amputee-Coalition’s-National-Conference-Welcomes-over-650-Attendees-in-the-Limb-Loss-and-Limb-Difference-Community'}
2022-08-25 14:36:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.businesswire.com/portal/site/home/template.BINARYPORTLET/search/resource.process/?javax.portlet.tpst=92055fbcbec7e639f1f554100d908a0c&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchTerm=amputee&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_resultsPage=1&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchType=news&javax.portlet.rid_92055fbcbec7e639f1f554100d908a0c=searchPaging&javax.portlet.rcl_92055fbcbec7e639f1f554100d908a0c=cacheLevelPage&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken>
{'URL': 'http://www.businesswire.com/news/home/20220809005759/en/Napoli-Shkolnik-PLLC-files-50-million-lawsuit-against-Suffolk-County'}
2022-08-25 14:36:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.businesswire.com/portal/site/home/template.BINARYPORTLET/search/resource.process/?javax.portlet.tpst=92055fbcbec7e639f1f554100d908a0c&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchTerm=amputee&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_resultsPage=1&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchType=news&javax.portlet.rid_92055fbcbec7e639f1f554100d908a0c=searchPaging&javax.portlet.rcl_92055fbcbec7e639f1f554100d908a0c=cacheLevelPage&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken>
{'URL': 'http://www.businesswire.com/news/home/20220727006087/en/Dialyze-Direct-Partners-with-CommuniCare-Health-to-Provide-On-Site-Dialysis'}
2022-08-25 14:36:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.businesswire.com/portal/site/home/template.BINARYPORTLET/search/resource.process/?javax.portlet.tpst=92055fbcbec7e639f1f554100d908a0c&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchTerm=amputee&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_resultsPage=1&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchType=news&javax.portlet.rid_92055fbcbec7e639f1f554100d908a0c=searchPaging&javax.portlet.rcl_92055fbcbec7e639f1f554100d908a0c=cacheLevelPage&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken>
{'URL': 'http://www.businesswire.com/news/home/20220602005023/en/Indian-Motorcycle-Partners-With-Veterans-Charity-Ride-for-8th-Annual-Motorcycle-Therapy-Program'}
2022-08-25 14:36:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.businesswire.com/portal/site/home/template.BINARYPORTLET/search/resource.process/?javax.portlet.tpst=92055fbcbec7e639f1f554100d908a0c&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchTerm=amputee&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_resultsPage=1&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchType=news&javax.portlet.rid_92055fbcbec7e639f1f554100d908a0c=searchPaging&javax.portlet.rcl_92055fbcbec7e639f1f554100d908a0c=cacheLevelPage&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken>
{'URL': 'http://www.businesswire.com/news/home/20220421005541/en/GreenRoom-Advisory-Lee-Puts-His-Only-Foot-Forward-–-Again!'}
2022-08-25 14:36:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.businesswire.com/portal/site/home/template.BINARYPORTLET/search/resource.process/?javax.portlet.tpst=92055fbcbec7e639f1f554100d908a0c&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchTerm=amputee&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_resultsPage=1&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchType=news&javax.portlet.rid_92055fbcbec7e639f1f554100d908a0c=searchPaging&javax.portlet.rcl_92055fbcbec7e639f1f554100d908a0c=cacheLevelPage&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken>
{'URL': 'http://www.businesswire.com/news/home/20220315005628/en/CORRECTING-and-REPLACING-NextGen-Healthcare-Teams-Up-with-Three-time-Paralympic-Medalist-Amy-Purdy-to-Champion-Whole-Person-Care'}
2022-08-25 14:36:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.businesswire.com/portal/site/home/template.BINARYPORTLET/search/resource.process/?javax.portlet.tpst=92055fbcbec7e639f1f554100d908a0c&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchTerm=amputee&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_resultsPage=1&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchType=news&javax.portlet.rid_92055fbcbec7e639f1f554100d908a0c=searchPaging&javax.portlet.rcl_92055fbcbec7e639f1f554100d908a0c=cacheLevelPage&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken>
{'URL': 'http://www.businesswire.com/news/home/20220314005652/en/Global-3D-Printing-Markets-2022---2027-by-Printer-Type-Materials-Software-Applications-Services-and-Solutions-in-Industry-Verticals---ResearchAndMarkets.com'}
2022-08-25 14:36:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.businesswire.com/portal/site/home/template.BINARYPORTLET/search/resource.process/?javax.portlet.tpst=92055fbcbec7e639f1f554100d908a0c&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchTerm=amputee&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_resultsPage=1&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchType=news&javax.portlet.rid_92055fbcbec7e639f1f554100d908a0c=searchPaging&javax.portlet.rcl_92055fbcbec7e639f1f554100d908a0c=cacheLevelPage&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken>
{'URL': 'http://www.businesswire.com/news/home/20220130005017/zh-CN/'}
2022-08-25 14:36:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.businesswire.com/portal/site/home/template.BINARYPORTLET/search/resource.process/?javax.portlet.tpst=92055fbcbec7e639f1f554100d908a0c&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchTerm=amputee&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_resultsPage=1&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchType=news&javax.portlet.rid_92055fbcbec7e639f1f554100d908a0c=searchPaging&javax.portlet.rcl_92055fbcbec7e639f1f554100d908a0c=cacheLevelPage&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken>
{'URL': 'http://www.businesswire.com/news/home/20220130005018/zh-HK/'}
2022-08-25 14:36:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.businesswire.com/portal/site/home/template.BINARYPORTLET/search/resource.process/?javax.portlet.tpst=92055fbcbec7e639f1f554100d908a0c&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchTerm=amputee&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_resultsPage=1&javax.portlet.rst_92055fbcbec7e639f1f554100d908a0c_searchType=news&javax.portlet.rid_92055fbcbec7e639f1f554100d908a0c=searchPaging&javax.portlet.rcl_92055fbcbec7e639f1f554100d908a0c=cacheLevelPage&javax.portlet.begCacheTok=com.vignette.cachetoken&javax.portlet.endCacheTok=com.vignette.cachetoken>
{'URL': 'http://www.businesswire.com/news/home/20220202005449/en/2021-Pain-Management-Devices-Pipeline-Product-Landscape-Report---ResearchAndMarkets.com'}
  • Related