Home > Mobile >  Selecting text where each char is stored in separate span
Selecting text where each char is stored in separate span

Time:01-25

I am trying to scrape a code chunk from this documentation page that hosts code in a peculiar way. Namely, the code chunk is divided into the function call and arguments having their own spans, parantheses and comma even have their own span. I am at a loss trying to extract this code snippet under 'Usage' with a scrapy spider. Here's the code for my spider, which also scrapes the documentation text.

import scrapy
import w3lib.html

class codeSpider(scrapy.Spider):
    name = 'mycodespider'

    def start_requests(self):
        url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html"
        yield scrapy.Request(url)

    def parse(self, response):
        docu = response.css('div#man-container p').getall()[2]
        code = response.css('pre::text').getall()
        yield {
            'docu': w3lib.html.remove_tags(docu).strip(),
            'code': code
        }

When trying to extract the text of the block using response.css('pre::text').getall() somehow only the punctation is returned, not the entire function call. This also includes the example at the bottom of the page, which I'd rather avoid but do not know how. Is there a better way to do this? I thought ::text would be perfect for this use case.

CodePudding user response:

Try iterating through the pre elements and extracting the text from them individually.

import scrapy
import w3lib.html

class codeSpider(scrapy.Spider):
    name = 'mycodespider'

    def start_requests(self):
        url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html"
        yield scrapy.Request(url)

    def parse(self, response):
        docu = response.css('div#man-container p').getall()[2]
        code = []
        for pre in response.css('pre'):
            code.append("".join(pre.css("::text").getall()))
        yield {
            'docu': w3lib.html.remove_tags(docu).strip(),
            'code': code
        }

OUTPUT:

2023-01-23 16:35:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html> (referer: None) ['cached']
2023-01-23 16:35:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html>
{'docu': 'Read in csv for  year and plot all the accidents in state\non a map.', 'code': [['1'], ['fars_map_state', '(', 'state.num', ',', ' ', 'year', ')', '\n'], ['1'], ['fars_map_state', '(', '1', ',', ' ', '2013
', ')', '\n']]}
2023-01-23 16:35:16 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-23 16:35:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 336,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 29477,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.108742,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2023, 1, 24, 0, 35, 16, 722131),
 'httpcache/hit': 1,
 'item_scraped_count': 1,
 'log_count/DEBUG': 4,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
  • Related