I'm developing a script that extracts text from all pdf files in a directory via a loop and inserts them into individual cells of a csv file. I can successfully write the output into the cells. However, I need the csv file to contain the header "text"
for merging with another csv. Thus far my attempts to insert that header with csv_writer
are running into difficulties.
For example, the code below successfully extracts and inserts the text from pdfs, but writes a new header for every file extracted:
import pdfplumber
import csv
import glob
pdfs = glob.glob("dir\*.pdf")
for pf in pdfs:
with pdfplumber.open(pf) as pdf, \
open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['text']) # code for inserting header
text = []
for page in pdf.pages:
extracted_text = page.extract_text()
if extracted_text:
text.append(extracted_text)
csv_output.writerow([' '.join(text)])
The other approach I've attempted is likewise unsuccessful. I tried to first write the header into the csv, and append the output of the loop to the csv. However, for some reason the formatting of the pdf output is completely disrupted, with text scattered across multiple cells instead of a single cell.
pdfs = glob.glob("dir\*.pdf")
# code for writing header
file = open("pdf_output.csv", "w", newline="")
writer = csv.writer(file)
headers = ['text']
writer.writerow(headers)
for pf in pdfs:
with pdfplumber.open(pf) as pdf, \
open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
text = []
for page in pdf.pages:
extracted_text = page.extract_text()
if extracted_text:
text.append(extracted_text)
csv_output.writerow([' '.join(text)])
Any suggestions on workarounds or better approaches for this challenge would be immensely welcome.
CodePudding user response:
You could open the csv first, insert your header, then iterate through your PDFs:
import pdfplumber
import csv
import glob
pdfs = glob.glob("dir\*.pdf")
with open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['text'])
for pf in pdfs:
with pdfplumber.open(pf) as pdf, \
open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
text = []
for page in pdf.pages:
extracted_text = page.extract_text()
if extracted_text:
text.append(extracted_text)
csv_output.writerow([' '.join(text)])
Or just check if its the first iteration:
import pdfplumber
import csv
import glob
pdfs = glob.glob("dir\*.pdf")
for i, pf in enumerate(pdfs):
with pdfplumber.open(pf) as pdf, \
open("pdf_output.csv", "a", newline="", encoding="utf-8") as f_output:
csv_output = csv.writer(f_output)
if i == 0: csv_output.writerow(['text'])
text = []
for page in pdf.pages:
extracted_text = page.extract_text()
if extracted_text:
text.append(extracted_text)
csv_output.writerow([' '.join(text)])