I am dipping my fingers in webscraping, and I am trying to scrape the Queensland Lobbyist Registers and the links within the main register using Scrapy. Each lobbyist has a link that can be followed to get their clientele list (for example, Antinomies and Australian Public Affairs; however, these nested tables are not consistent within each page.
For Antimonies, for example, the xpath for clients is //[@id="main"]/table[7]
, and it starts at row 20, and for the APF*, it's //*[@id="main"]/table[6]
, and starts at row 24. The common thing is that both client subtables come under this row:
"Client/s on whose behalf lobbying activity is, or may be, conducted"
Is there a way that Scrapy can be coded to read rows only after specific rows for each page?
I have been using the following:
tableclients = response.xpath('//*[@id="main"]/table[7]//tbody')
rowclients = tableclients.xpath('//tr')
CodePudding user response:
yes, it's possible to scrape HTML tables using scrapy based on text criteria which is mostlikely: Client/s on whose behalf lobbying activity is, or may be, conducted
. Select h2
tag with it's text node value using contains()
method and find the preceding-sibling which is table no:7 and from here you have to grab the desired data.
An example with working code:
from scrapy.crawler import CrawlerProcess
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
custom_settings = {
#'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
#'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
def start_requests(self):
yield scrapy.Request(
url='https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists-full-list',
callback=self.parse,
#dont_filter=True
)
def parse(self, response):
for Lobbyist in response.xpath('//*[@id="table01546"]/tbody//tr/td[3]/a/@href'):
link = Lobbyist.get()
yield scrapy.Request(
url=link,
callback = self.parse_client_data,
)
def parse_client_data(self, response):
for tr in response.xpath('//*[contains(text(),"Returns")]/preceding-sibling::table[7]/tbody//tr'):
td1 = ''.join(tr.xpath('.//td[1]//text()').getall()).replace(':','').strip().replace('\xa0','')
td2 = tr.xpath('.//td[2]//text()')
td2= ''.join(td2.getall()).strip().replace('\xa0',' ') if td2 else None
yield {td1: td2}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(TestSpider)
process.start()
Output:
{'Email Address': '[email protected]'}
2022-09-11 22:02:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/CMAX-Communications>
{'ACN/ ABN': '73 130 740 546'}
2022-09-11 22:02:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/CMAX-Communications>
{'Trading Name': 'CMAX Advisory'}
2022-09-11 22:02:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd> (referer: https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists-full-list)
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Company Name': 'Counsel House Pty Ltd'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'CompanyAddress': 'Level 14, 333 Collins Street, Melbourne VIC 3000'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Phone Number': '03 8639 5890'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Email Address': '[email protected]'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'ACN/ ABN': '35 631 919 009'}
2022-09-11 22:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/counsel-house-pty-ltd>
{'Trading Name': 'Counsel House Pty Ltd'}
2022-09-11 22:02:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists> (referer: https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists-full-list)
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Company Name': 'Australian Society of Ophthalmologists'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'CompanyAddress': '6/183 Wickham Terrace, Brisbane QLD 4000'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Phone Number': '07 383103006'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Email Address': '[email protected]'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'ACN/ ABN': '29 454 001 424'}
2022-09-11 22:02:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.parliament.act.gov.au/function/tru/act-register-of-lobbyists/act-register-of-lobbyists4/Australian-Society-of-Ophthalmologists>
{'Trading Name': 'Australian Society of Ophthalmologists'}
'downloader/response_status_count/200': 52,
'item_scraped_count': 255,
... so on
CodePudding user response:
Try something like this:
//h3[contains(text(), 'Your text')]/following-sibling::div[1]/text()