Why does extracting file data in PyMuPDF give me empty lists?-CodePudding

I am new to programming (just do it for fun sometimes) and I am having trouble using PyMuPDF.

In VS Code, it returns no errors but the output is always just an empty list.

Here is the code:

> import fitz

file_path = "/Users/conor/Desktop/projects/png2pdf.pdf"
def extract_text_from_pdf(file_path):
    # Open the pdf file
    pdf_document = fitz.open(file_path)
    # Initialize an empty list to store the text
    text = []
    # Iterate through the pages
    for page in pdf_document:
        # Extract the text from the page
        page_text = page.get_text()
        # Append the text to the list
        text.append(page_text)
    # Close the pdf document
    pdf_document.close()
    # Return the list of text
    return text

if __name__ == '__main__':
    file_path = "/Users/conor/Desktop/projects/png2pdf.pdf"
    text = extract_text_from_pdf(file_path)

CodePudding user response：

Based on the name of the file, I'm going to guess this was an image that was converted to a PDF. In that case, the PDF does not contain any text. It just contains an image.

If you convert a Word document to a PDF, the words in the Word document are present in the PDF, along with instructions on what font to use and where to place them. But when you convert an image to a PDF, all you have are the bytes in the image. There is no text.

If you really want to explore this further, what you need is an OCR package (Optical Character Recognition). There are Python packages for doing that (like pytesseract), but they can be finicky.

FOLLOWUP

PyMuPDF can do OCR, if the Tesseract package is installed. You need to scan through the documentation.

https://pymupdf.readthedocs.io/en/latest/functions.html

CodePudding user response：

One possibility is that the PDF file does not contain any text. PyMuPDF uses OCR to extract text from PDFs, so if the PDF is an image-only PDF or if the text is not in a format that PyMuPDF's OCR can recognize, it may not be able to extract any text.