Home > Back-end >  Webscraping pdfs in Python in multiple links
Webscraping pdfs in Python in multiple links

Time:07-11

I am trying to webscrape this website. To do so, I wrote the following code which works nicely:

from bs4 import BeautifulSoup
import pandas as pd
import requests

payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
    "content-type": "application/x-www-form-urlencoded",
    "X-Requested-With": "XMLHttpRequest"
    }

req=requests.post(url,headers=headers,data=payload)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
    r = BeautifulSoup(requests.get(f"https://www.bis.org{card.a.get('href')}").content)
    data.append({
        'date': card.select_one('.item_date').get_text(strip=True),
        'title': card.select_one('.title a').get_text(strip=True),
        'author': card.select_one('.authorlnk.dashed').get_text(strip=True),
        'url': f"https://www.bis.org{card.a.get('href')}",
        'text': r.select_one('#cmsContent').get_text('\n\n', strip=True)
        })

pd.DataFrame(data)

However, if you for example open the first link of the page, there is a pdf in it. I would like to add to my dataframe - whenever there is a pdf in the link - the content of the pdf.

To do so, I have been looking around and I tried the following only on the first pdf of the first link:

import io
from PyPDF2 import PdfFileReader


def info(pdf_path):
    response = requests.get(pdf_path)
     
    with io.BytesIO(response.content) as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
 
    txt = f"""
    Information about {pdf_path}:
 
    Author: {information.author}
    Creator: {information.creator}
    Producer: {information.producer}
    Subject: {information.subject}
    Title: {information.title}
    Number of pages: {number_of_pages}
    """
    print(txt)
    return information
 
info('https://www.bis.org/review/r220708e.pdf')
  

However, it just gets the info (which I already have from the previous code), while it is missing the text. Ideally, I would like it to be part of the same code as above. I got stuck here.

Can anyone help me with this?

Thanks!

CodePudding user response:

You need to return it, e.g. as a tuple :

return txt, information

If you want the text inside the pdf:

text = ""
for page in pdf.pages:
    text  = page.extract_text()   "\n"

CodePudding user response:

I'll allow you the pleasure of adapting this to your requests, sync scraping fashion (really not hard):

from PyPDF2 import PdfReader
...
async def get_full_content(url):
    async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client:
        if url[-3:] == 'pdf':
            r = await client.get(url)
            with open(f'{url.split("/")[-1]}', 'wb') as f:
                f.write(r.content)
                reader = PdfReader(f'{url.split("/")[-1]}')
                pdf_text = ''
                number_of_pages = len(reader.pages)
                for x in range(number_of_pages):
                    page = reader.pages[x]
                    text = page.extract_text()
                    pdf_text = pdf_text   text

And then you do something with the pdf_text extracted from .pdf (saving it into a db, reading it with pandas, nlp-ing it with Transformers/torch, etc).


Edit: one more thing: do a pip install -U pypdf2 as the package was recently updated (a few hours ago), just to make sure you're up to date.

  • Related