I am trying to scrape a code chunk from this documentation page that hosts code in a peculiar way. Namely, the code chunk is divided into the function call and arguments having their own spans, parantheses and comma even have their own span. I am at a loss trying to extract this code snippet under 'Usage' with a scrapy spider. Here's the code for my spider, which also scrapes the documentation text.
import scrapy
import w3lib.html
class codeSpider(scrapy.Spider):
name = 'mycodespider'
def start_requests(self):
url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html"
yield scrapy.Request(url)
def parse(self, response):
docu = response.css('div#man-container p').getall()[2]
code = response.css('pre::text').getall()
yield {
'docu': w3lib.html.remove_tags(docu).strip(),
'code': code
}
When trying to extract the text of the block using response.css('pre::text').getall()
somehow only the punctation is returned, not the entire function call. This also includes the example at the bottom of the page, which I'd rather avoid but do not know how.
Is there a better way to do this? I thought ::text
would be perfect for this use case.
CodePudding user response:
Try iterating through the pre elements and extracting the text from them individually.
import scrapy
import w3lib.html
class codeSpider(scrapy.Spider):
name = 'mycodespider'
def start_requests(self):
url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html"
yield scrapy.Request(url)
def parse(self, response):
docu = response.css('div#man-container p').getall()[2]
code = []
for pre in response.css('pre'):
code.append("".join(pre.css("::text").getall()))
yield {
'docu': w3lib.html.remove_tags(docu).strip(),
'code': code
}
OUTPUT:
2023-01-23 16:35:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html> (referer: None) ['cached']
2023-01-23 16:35:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html>
{'docu': 'Read in csv for year and plot all the accidents in state\non a map.', 'code': [['1'], ['fars_map_state', '(', 'state.num', ',', ' ', 'year', ')', '\n'], ['1'], ['fars_map_state', '(', '1', ',', ' ', '2013
', ')', '\n']]}
2023-01-23 16:35:16 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-23 16:35:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 336,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 29477,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.108742,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 1, 24, 0, 35, 16, 722131),
'httpcache/hit': 1,
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,