I am trying to a get a pdf online using something like requests
and convert it to a a string in Python. I don't want to end up with the pdf in my hard disk. instead I want to get a of online and work on it in terms of text/string in python3.
For example say you have a pdf file with the contents: I love programming.
url = 'xyzzy.org/g.pdf'
re = requests.get(url)
# do something to re and assign it to `pdf`
convert_to_string(pdf) -> "I love programming"
CodePudding user response:
As pointed out in the comments, you can divide this task into two parts:
- Download the pdf through a stream object
- Convert the in-memory pdf into a string
This should do the job (it needs the PyMuPDF package):
import io
import requests
import fitz
url = "http://.../sample.pdf"
response = requests.get(url)
pdf = io.BytesIO(response.content)
with fitz.open(stream=pdf) as doc:
text = ""
for page in doc:
text = page.get_text()
print(text)