I want to read two PDF files from URL without download. Then I want to extract text using pdftotext
import pdftotext
with open("pdf_path1", "rb") as f:
pdf = pdftotext.PDF(f)
# If it's password-protected
with open("b.pdf", "rb") as f:
pdf = pdftotext.PDF(f, "secret")
# How many pages?
print(len(pdf))
# Iterate over all the pages
for page in pdf:
print(page)
# Read some individual pages
print(pdf[0])
print(pdf[1])
# Read all the text into one string
print("\n\n".join(pdf))
How can I resolve this error? or is there any other technique available to read PDF from URL?
CodePudding user response:
you can open the file directly from the url and then work on it as a pdf by using urllib.request
:
import pdftotext
from urllib.request import urlopen
target_url = "https://arxiv.org/pdf/2012.05439.pdf" # to change.
file = urlopen(target_url)
pdf = pdftotext.PDF(file) # add password if password protected.
# How many pages?
print(len(pdf))
# Iterate over all the pages
for page in pdf:
print(page)
# Read some individual pages
print(pdf[0])
print(pdf[1])
# Read all the text into one string
print("\n\n".join(pdf))
CodePudding user response:
You cannot read somebodies PDF online it must be your copy (ALL PDFS MUST BE DOWNLOADED). Your computer can only work with local HTML pages and their contents, thats the way it was, and still is:-
How the web works in just one line, (More graphic methods are available).
<A HyperRef=HTextTransferProtocol://www.website.html>download to view our BBS pages</a>
curl -o temp.pdf https://arxiv.org/pdf/2012.05439.pdf & pdftotext -layout -f 1 -l 1 temp.pdf -
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1318k 100 1318k 0 0 488k 0 0:00:02 0:00:02 --:--:-- 488k
Scheduling Beyond CPUs for HPC....