Home > other >  Parsing Multiple PDFs as a Dataframe
Parsing Multiple PDFs as a Dataframe

Time:09-28

How do I copy the contents of the entire PDF (of multiple PDFs in a folder) in a single cell (say, column B) and file name in column A? Right now, this code parses all PDFs, but each line in PDF is saved a separate row in the dataframe. I need each PDF as a single row.

from pathlib import Path
import fitz
import pandas as pd

# returns all file paths that has .pdf as extension in the specified directory
fold = "C:/Users/talen/OneDrive/Application Development/data/ForParse/"
pdf_search = Path(fold).glob("*.pdf")
# convert the glob generator out put to list

pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]

pdf_txt = ""
for pdf in pdf_files:
    with fitz.open(pdf) as doc:
        
        for page in doc:
            pdf_txt  = page.getText()


with open('pdf_txt.txt','w', encoding='utf-8') as f: #Converting to text file
    f.write(pdf_txt)

data=pd.read_table('pdf_txt.txt', lineterminator='\n')  #Converting text file to dataframe
print(data)

I also tried using a "sep='\n'" which gives me an error: ValueError: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.

CodePudding user response:

First of all you do not need to convert the PDF file to Text file. Rather you can directly paste the text of the PDF file into any cell of the dataframe.

  1. Create an empty list textStr=[] to store the text of the PDF file using textStr.append(Page.get_text("text").replace('\n',' ')). Here you need to iterate through the pages of the PDF file.
  2. Join the items of the list textStr=[] to form a string Text=' '.join(textStr).
  3. Now paste the string Text at any location in the dataframe, say at second(1) row and B th column given by df.at[1,'B']=Text
  • Related