Home > Software engineering >  How do get a pdf online and convert it to a string in Python?
How do get a pdf online and convert it to a string in Python?

Time:06-23

I am trying to a get a pdf online using something like requests and convert it to a a string in Python. I don't want to end up with the pdf in my hard disk. instead I want to get a of online and work on it in terms of text/string in python3.

For example say you have a pdf file with the contents: I love programming.

url = 'xyzzy.org/g.pdf'
re = requests.get(url)
# do something to re and assign it to `pdf`
convert_to_string(pdf) -> "I love programming"

CodePudding user response:

As pointed out in the comments, you can divide this task into two parts:

  1. Download the pdf through a stream object
  2. Convert the in-memory pdf into a string

This should do the job (it needs the PyMuPDF package):

import io
import requests
import fitz

url = "http://.../sample.pdf"

response = requests.get(url)
pdf = io.BytesIO(response.content)
with fitz.open(stream=pdf) as doc:
    text = ""
    for page in doc:
        text  = page.get_text()
print(text)
  • Related