Is it possible to scrape HTML tables using scrapy shell based on text criteria?-CodePudding

I am dipping my fingers in webscraping, and I am trying to scrape the Queensland Lobbyist Registers and the links within the main register using Scrapy. Each lobbyist has a link that can be followed to get their clientele list (for example, Antinomies and Australian Public Affairs; however, these nested tables are not consistent within each page. For Antimonies, for example, the xpath for clients is //[@id="main"]/table[7], and it starts at row 20, and for the APF*, it's //*[@id="main"]/table[6], and starts at row 24. The common thing is that both client subtables come under this row:

"Client/s on whose behalf lobbying activity is, or may be, conducted"

Is there a way that Scrapy can be coded to read rows only after specific rows for each page?

I have been using the following:

tableclients = response.xpath('//*[@id="main"]/table[7]//tbody') rowclients = tableclients.xpath('//tr')

CodePudding user response：

yes, it's possible to scrape HTML tables using scrapy based on text criteria which is mostlikely: Client/s on whose behalf lobbying activity is, or may be, conducted. Select h2 tag with it's text node value using contains() method and find the preceding-sibling which is table no:7 and from here you have to grab the desired data.

An example with working code:

from scrapy.crawler import CrawlerProcess

import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    custom_settings = {
        #'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        #'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
    
    def start_requests(self):
    
        yield scrapy.Request(
            url='https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists-full-list',
            callback=self.parse,
            #dont_filter=True
            )
    
    def parse(self, response):
        for Lobbyist in response.xpath('//*[@id="table01546"]/tbody//tr/td[3]/a/@href'):
            link = Lobbyist.get()
            yield scrapy.Request(
                url=link,
                callback = self.parse_client_data,
                
            )
    def parse_client_data(self, response):
        for tr in response.xpath('//*[contains(text(),"Returns")]/preceding-sibling::table[7]/tbody//tr'):
            td1 = ''.join(tr.xpath('.//td[1]//text()').getall()).replace(':','').strip().replace('\xa0','')
            td2 = tr.xpath('.//td[2]//text()')
            td2= ''.join(td2.getall()).strip().replace('\xa0',' ') if td2 else None
            yield {td1: td2}
        
             
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

Output:

{'Email Address': '[email protected]'}
2022-09-11 22:02:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/CMAX-Communications>
{'ACN/ ABN': '73 130 740 546'}
2022-09-11 22:02:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/CMAX-Communications>
{'Trading Name': 'CMAX Advisory'}
2022-09-11 22:02:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd> (referer: https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists-full-list)
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Company Name': 'Counsel House Pty Ltd'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'CompanyAddress': 'Level 14, 333 Collins Street, Melbourne VIC 3000'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Phone Number': '03 8639 5890'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Email Address': '[email protected]'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'ACN/ ABN': '35 631 919 009'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Trading Name': 'Counsel House Pty Ltd'}
2022-09-11 22:02:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists> (referer: https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists-full-list)
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Company Name': 'Australian Society of Ophthalmologists'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'CompanyAddress': '6/183 Wickham Terrace, Brisbane QLD 4000'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Phone Number': '07 383103006'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Email Address': '[email protected]'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'ACN/ ABN': '29 454 001 424'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Trading Name': 'Australian Society of Ophthalmologists'}

 'downloader/response_status_count/200': 52,
 
 'item_scraped_count': 255,

... so on

CodePudding user response：

Try something like this:

//h3[contains(text(), 'Your text')]/following-sibling::div[1]/text()