I am trying to webscrape this website. To do so, I wrote the following code which works nicely:
from bs4 import BeautifulSoup
import pandas as pd
import requests
payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
"content-type": "application/x-www-form-urlencoded",
"X-Requested-With": "XMLHttpRequest"
}
req=requests.post(url,headers=headers,data=payload)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
r = BeautifulSoup(requests.get(f"https://www.bis.org{card.a.get('href')}").content)
data.append({
'date': card.select_one('.item_date').get_text(strip=True),
'title': card.select_one('.title a').get_text(strip=True),
'author': card.select_one('.authorlnk.dashed').get_text(strip=True),
'url': f"https://www.bis.org{card.a.get('href')}",
'text': r.select_one('#cmsContent').get_text('\n\n', strip=True)
})
pd.DataFrame(data)
However, if you for example open the first link of the page, there is a pdf in it. I would like to add to my dataframe - whenever there is a pdf in the link - the content of the pdf.
To do so, I have been looking around and I tried the following only on the first pdf of the first link:
import io
from PyPDF2 import PdfFileReader
def info(pdf_path):
response = requests.get(pdf_path)
with io.BytesIO(response.content) as f:
pdf = PdfFileReader(f)
information = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
txt = f"""
Information about {pdf_path}:
Author: {information.author}
Creator: {information.creator}
Producer: {information.producer}
Subject: {information.subject}
Title: {information.title}
Number of pages: {number_of_pages}
"""
print(txt)
return information
info('https://www.bis.org/review/r220708e.pdf')
However, it just gets the info (which I already have from the previous code), while it is missing the text. Ideally, I would like it to be part of the same code as above. I got stuck here.
Can anyone help me with this?
Thanks!
CodePudding user response:
You need to return it, e.g. as a tuple :
return txt, information
If you want the text inside the pdf:
text = ""
for page in pdf.pages:
text = page.extract_text() "\n"
CodePudding user response:
I'll allow you the pleasure of adapting this to your requests
, sync scraping fashion (really not hard):
from PyPDF2 import PdfReader ... async def get_full_content(url): async with AsyncClient(headers=headers, timeout=60.0, follow_redirects=True) as client: if url[-3:] == 'pdf': r = await client.get(url) with open(f'{url.split("/")[-1]}', 'wb') as f: f.write(r.content) reader = PdfReader(f'{url.split("/")[-1]}') pdf_text = '' number_of_pages = len(reader.pages) for x in range(number_of_pages): page = reader.pages[x] text = page.extract_text() pdf_text = pdf_text text
And then you do something with the pdf_text
extracted from .pdf (saving it into a db, reading it with pandas, nlp-ing it with Transformers/torch, etc).
Edit: one more thing: do a pip install -U pypdf2
as the package was recently updated (a few hours ago), just to make sure you're up to date.