Home > Net >  How to use Scrapy to parse PDFs?
How to use Scrapy to parse PDFs?

Time:02-11

I would like to download all PDFs found on a site, e.g. https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html. I also tried to use rules but I think it's not neccessary here.

This is my approach:

import scrapy
from scrapy.linkextractors import IGNORED_EXTENSIONS
CUSTOM_IGNORED_EXTENSIONS = IGNORED_EXTENSIONS.copy()
CUSTOM_IGNORED_EXTENSIONS.remove('pdf')

class PDFParser(scrapy.Spider):
    name = 'stadt_koeln_amtsblatt'

    # URL of the pdf file
    start_urls = ['https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html']

    rules = (
        Rule(LinkExtractor(allow=r'.*\.pdf', deny_extensions=CUSTOM_IGNORED_EXTENSIONS), callback='parse', follow=True),
    )

    def parse(self, response):
        # selector of pdf file.
        for pdf in response.xpath("//a[contains(@href, 'pdf')]"):
            yield scrapy.Request(
                url=response.urljoin(pdf),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

It seems there are two problems. The first one when extracting all the pdf links with xpath:

TypeError: Cannot mix str and non-str arguments

and the second problem is about handling the pdf file itself. I just want to store it locally in a specific folder or similar. It would be really great if someone has a working example for this kind of site.

CodePudding user response:

To download files you need to use the FilesPipeline. This requires that you enable it in ITEM_PIPELINES and then provide a field named file_urls in your yielded item. In the example below, I have created an extenstion of the FilesPipeline in order to retain the filename of the pdf as provided on the website. The files will be saved in a folder named downloaded_files in the current directory

Read more about the filespipeline from the docs

import scrapy
from scrapy.pipelines.files import FilesPipeline

class PdfPipeline(FilesPipeline):
    # to save with the name of the pdf from the website instead of hash
    def file_path(self, request, response=None, info=None):
        file_name = request.url.split('/')[-1]
        return file_name

class StadtKoelnAmtsblattSpider(scrapy.Spider):
    name = 'stadt_koeln_amtsblatt'
    start_urls = ['https://www.stadt-koeln.de/politik-und-verwaltung/bekanntmachungen/amtsblatt/index.html']

    custom_settings = {
        "ITEM_PIPELINES": {
            PdfPipeline: 100
        },
        "FILES_STORE": "downloaded_files"
    }

    def parse(self, response):
        links = response.xpath("//a[@class='download pdf pdf']/@href").getall()
        links = [response.urljoin(link) for link in links] # to make them absolute urls

        yield {
            "file_urls": links
        }
  • Related