I am following this guide on how to extract data from Unstructured PDFs using PyMuPDF.
https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/
I am getting an AttributeError: 'NoneType' object has no attribute 'rect' error when I followed the code and not sure what is going on since I am new to Python.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-7f394b979351> in <module>
1 first_annots=[]
2
----> 3 rec=page1.first_annot.rect
4
5 rec
AttributeError: 'NoneType' object has no attribute 'rect'
---------------------------------------------------------------------------
Code
import fitz
import pandas as pd
doc = fitz.open('Mansfield--70-21009048 - ConvertToExcel.pdf')
page1 = doc[0]
words = page1.get_text("words")
words[0]
first_annots=[]
rec=page1.first_annot.rect
rec
#Information of words in first object is stored in mywords
mywords = [w for w in words if fitz.Rect(w[:4]) in rec]
ann= make_text(mywords)
first_annots.append(ann)
def make_text(words):
line_dict = {}
words.sort(key=lambda w: w[0])
for w in words:
y1 = round(w[3], 1)
word = w[4]
line = line_dict.get(y1, [])
line.append(word)
line_dict[y1] = line
lines = list(line_dict.items())
lines.sort()
return "n".join([" ".join(line[1]) for line in lines])
print(rec)
print(first_annots)
CodePudding user response:
The property first_annot
of a PyMuPDF Page
object either contains the first annotation or None
if there are no annotations.
This is where your error comes from.
But you also seem confused about the fact, that annotations have nothing to do with a page's text - which you extract by method Page.get_text()
. Using option "words" in this generalized extraction method returns a list of items (x0, y0, x1, y1, "word", ...)
.
The first four subitems are the coordinates of the rectangle wrapping the text "word". If you sort by first parameter (x0) only, then those items will occur first, that appear leftmost - independently of their vertical posistion.
I hope this is what you actually want - your code suggests otherwise.
To sort in the common way (top-left to bottom-right), simply use this form of the method: page.get_text("words", sort=True)
.
Also be aware that words appearing to be in the same line may still have y-coordinates that differ by some minute value (indistinguishable to the eyes), so you may want to code the sorting yourself - e.g. using rounded y-coordinates, etc.
CodePudding user response:
The problem appear to be related to the PDF file you have used. I am not sure how you took the exact same pdf from the guide that you have shared.
If you have saved those images and exported to PDF, then below 2 behaviours can be expected:
page1.first_annot
will returnNone
as the bounding boxes in the sample images after exporting to pdf doesn't seem to work. If you try to redraw those bounding boxes in the exported PDF, it will give you the result of the first bounding box.- Regardless of this, if you try to call
page1.get_text("words")
, it is not going to work in this case. It will give empty results.
I would recommend trying out this with a sample pdf that you get from google and see the results.