Here's the code I have been trying with the output:
import fitz
import pandas as pd
doc = fitz.open('xyz.pdf')
page1 = doc[0]
words = page1.get_text("words")
first_annots=[]
rec=page1.first_annot.rect
rec
the output I am expecting is all text rectangles to be identified and called separately. Here's where i found the code that i am implementing: https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/
CodePudding user response:
Independent from your overall intention (to parse unstructured text):
Accessing the page's annotations via page.first_annot
makes no sense at all.
Your exception is caused by the fact that that page page has no annotations, and therefore page.first_annot
is None
of course.
Again: whether or not there are annotations has nothing to do with the text of the page. Simply do not access page.first_annot
.