Home > database >  How do you avoid text from cropped parts in PyPDF?
How do you avoid text from cropped parts in PyPDF?

Time:10-14

I'm quite new to python and I'm doing a ML project to extract disclosures from PDF's (published annual reports). PyPDF extracts the disclosures i need for my project but it also includes the text from footers in the text which i want to remove. I browsed through stack-overflow and found a solution to successfully crop out the footer part through PyPDF and save the file as a new pdf. But when I run the cropped PDF through my original code, the text from footers are still included in the extracted text. Is there anyway I can overcome this ?

CodePudding user response:

Not sure after extracted the desired text, why you wish to save it as new pdf & then load it again... anyways, follow the below suggestion...

So, after cropping the footer part from original pdf, instead of saving the extracted text as new pdf... save it as word document... The idea is when we load a word document in python using "docx2python" library, it separates out header, footer, body in it's properties...

My guess is;

1.) The new saved word document shouldn't have any header/footer, just the text...

2.) And in case if loaded word document still has the footer then you can get rid of using the same library...

  • Related