I am new to programming (just do it for fun sometimes) and I am having trouble using PyMuPDF.
In VS Code, it returns no errors but the output is always just an empty list.
Here is the code:
> import fitz
file_path = "/Users/conor/Desktop/projects/png2pdf.pdf"
def extract_text_from_pdf(file_path):
# Open the pdf file
pdf_document = fitz.open(file_path)
# Initialize an empty list to store the text
text = []
# Iterate through the pages
for page in pdf_document:
# Extract the text from the page
page_text = page.get_text()
# Append the text to the list
text.append(page_text)
# Close the pdf document
pdf_document.close()
# Return the list of text
return text
if __name__ == '__main__':
file_path = "/Users/conor/Desktop/projects/png2pdf.pdf"
text = extract_text_from_pdf(file_path)
CodePudding user response:
Based on the name of the file, I'm going to guess this was an image that was converted to a PDF. In that case, the PDF does not contain any text. It just contains an image.
If you convert a Word document to a PDF, the words in the Word document are present in the PDF, along with instructions on what font to use and where to place them. But when you convert an image to a PDF, all you have are the bytes in the image. There is no text.
If you really want to explore this further, what you need is an OCR package (Optical Character Recognition). There are Python packages for doing that (like pytesseract), but they can be finicky.
FOLLOWUP
PyMuPDF can do OCR, if the Tesseract package is installed. You need to scan through the documentation.
https://pymupdf.readthedocs.io/en/latest/functions.html
CodePudding user response:
One possibility is that the PDF file does not contain any text. PyMuPDF uses OCR to extract text from PDFs, so if the PDF is an image-only PDF or if the text is not in a format that PyMuPDF's OCR can recognize, it may not be able to extract any text.