Home > Blockchain >  I am trying to use Fitz to extract data from a pdf that contains text in a very unstructured format.
I am trying to use Fitz to extract data from a pdf that contains text in a very unstructured format.

Time:11-09

Here's the code I have been trying with the output:

import fitz
import pandas as pd 
doc = fitz.open('xyz.pdf')
page1 = doc[0]
words = page1.get_text("words")

first_annots=[]

rec=page1.first_annot.rect

rec


Output: output of above

the output I am expecting is all text rectangles to be identified and called separately. Here's where i found the code that i am implementing: https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/

CodePudding user response:

Independent from your overall intention (to parse unstructured text): Accessing the page's annotations via page.first_annot makes no sense at all.

Your exception is caused by the fact that that page page has no annotations, and therefore page.first_annot is None of course.

Again: whether or not there are annotations has nothing to do with the text of the page. Simply do not access page.first_annot.

  • Related