PyMuPDF - How to Data Extract from Unstructured PDFs using PyMuPDF in python?-CodePudding

I am following this guide on how to extract data from Unstructured PDFs using PyMuPDF.

https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/

I am getting an AttributeError: 'NoneType' object has no attribute 'rect' error when I followed the code and not sure what is going on since I am new to Python.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-7f394b979351> in <module>
      1 first_annots=[]
      2 
----> 3 rec=page1.first_annot.rect
      4 
      5 rec

AttributeError: 'NoneType' object has no attribute 'rect'

---------------------------------------------------------------------------

Code

import fitz
import pandas as pd 
doc = fitz.open('Mansfield--70-21009048 - ConvertToExcel.pdf')
page1 = doc[0]
words = page1.get_text("words")
words[0]

first_annots=[]

rec=page1.first_annot.rect

rec

#Information of words in first object is stored in mywords

mywords = [w for w in words if fitz.Rect(w[:4]) in rec]

ann= make_text(mywords)

first_annots.append(ann)

def make_text(words):

    line_dict = {} 

    words.sort(key=lambda w: w[0])

    for w in words:  

        y1 = round(w[3], 1)  

        word = w[4] 

        line = line_dict.get(y1, [])  

        line.append(word)  

        line_dict[y1] = line  

    lines = list(line_dict.items())

    lines.sort()  

    return "n".join([" ".join(line[1]) for line in lines])

print(rec)
print(first_annots)

CodePudding user response：

The property first_annot of a PyMuPDF Page object either contains the first annotation or None if there are no annotations. This is where your error comes from. But you also seem confused about the fact, that annotations have nothing to do with a page's text - which you extract by method Page.get_text(). Using option "words" in this generalized extraction method returns a list of items (x0, y0, x1, y1, "word", ...). The first four subitems are the coordinates of the rectangle wrapping the text "word". If you sort by first parameter (x0) only, then those items will occur first, that appear leftmost - independently of their vertical posistion. I hope this is what you actually want - your code suggests otherwise.

To sort in the common way (top-left to bottom-right), simply use this form of the method: page.get_text("words", sort=True).

Also be aware that words appearing to be in the same line may still have y-coordinates that differ by some minute value (indistinguishable to the eyes), so you may want to code the sorting yourself - e.g. using rounded y-coordinates, etc.

CodePudding user response：

The problem appear to be related to the PDF file you have used. I am not sure how you took the exact same pdf from the guide that you have shared.

If you have saved those images and exported to PDF, then below 2 behaviours can be expected:

page1.first_annot will return None as the bounding boxes in the sample images after exporting to pdf doesn't seem to work. If you try to redraw those bounding boxes in the exported PDF, it will give you the result of the first bounding box.
Regardless of this, if you try to call page1.get_text("words"), it is not going to work in this case. It will give empty results.

I would recommend trying out this with a sample pdf that you get from google and see the results.