how to read pdf from url using pdftotext-CodePudding

I want to read two PDF files from URL without download. Then I want to extract text using pdftotext

import pdftotext


with open("pdf_path1", "rb") as f:
    pdf = pdftotext.PDF(f)

# If it's password-protected
with open("b.pdf", "rb") as f:
    pdf = pdftotext.PDF(f, "secret")

# How many pages?
print(len(pdf))

# Iterate over all the pages
for page in pdf:
    print(page)

# Read some individual pages
print(pdf[0])
print(pdf[1])

# Read all the text into one string
print("\n\n".join(pdf))

How can I resolve this error? or is there any other technique available to read PDF from URL?

CodePudding user response：

you can open the file directly from the url and then work on it as a pdf by using urllib.request :

import pdftotext
from urllib.request import urlopen

target_url = "https://arxiv.org/pdf/2012.05439.pdf" #  to change.
file = urlopen(target_url)

pdf = pdftotext.PDF(file) # add password if password protected.

# How many pages?
print(len(pdf))

# Iterate over all the pages
for page in pdf:
    print(page)

# Read some individual pages
print(pdf[0])
print(pdf[1])

# Read all the text into one string
print("\n\n".join(pdf))

CodePudding user response：

You cannot read somebodies PDF online it must be your copy (ALL PDFS MUST BE DOWNLOADED). Your computer can only work with local HTML pages and their contents, thats the way it was, and still is:-

How the web works in just one line, (More graphic methods are available).

<A HyperRef=HTextTransferProtocol://www.website.html>download to view our BBS pages</a>

curl -o temp.pdf https://arxiv.org/pdf/2012.05439.pdf & pdftotext -layout -f 1 -l 1 temp.pdf -

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1318k  100 1318k    0     0   488k      0  0:00:02  0:00:02 --:--:--  488k
                                                                          Scheduling Beyond CPUs for HPC....