Home > database >  Python & Pandas: combining multiple rows into single cell
Python & Pandas: combining multiple rows into single cell

Time:11-11

I'm writing a script that extracts text from a pdf file and inserts it as a string into a single csv row. Using pdfplumbr I can successfully extract the text, with each page's text inserted into the csv as an individual row. However, I'm struggling to figure out how to combine those rows into a single cell. I'm attempting Pandas pd.concat function to combine them, but so far without success.

Here's my code:

import pdfplumber
import pandas as pd
import csv

file1 = open("pdf_texts.csv", "w", newline="")
file2 = open("pdf_text_pgs.csv", "w", newline="")
writer2 = csv.writer(file2)
headers = ['text']

with pdfplumber.open('target.pdf') as pdf:
    pdf_length = len(pdf.pages)

    writer2.writerow(headers)

    for page_number in range(0, pdf_length):
        pdf_output = pdf.pages[page_number]
        pdf_txt = pdf_output.extract_text().encode('UTF-8')
        writer2.writerow([pdf_txt])

    # this is my attempt for pd.concat
    df  = pd.read_csv("pdf_text_pgs.csv", 'r')
    df_txts = df['text']
    pdf_txt_df = pd.concat([df_txts], axis=0, ignore_index=True)
    pdf_txt_df.to_csv('pdf_texts.csv', header=False, index=False)

However, the final output fails to combine the rows, and worse yet seems to lose the final row. Any suggestions on how to approach this? All help gratefully appreciated.

CodePudding user response:

You would just need to store the text from each page in a list and combine it all at the end. For example:

import pdfplumber
import csv

with pdfplumber.open('target.pdf') as pdf, \
     open("pdf_text_pgs.csv", "w", newline="", encoding="utf-8") as f_output:

    csv_output = csv.writer(f_output)
    csv_output.writerow(['text'])

    text = []
    
    for page in pdf.pages:
        extracted_text = page.extract_text()
        
        if extracted_text:  # skip empty pages or pages with images
            text.append(extracted_text)
        
    csv_output.writerow([' '.join(text)])
  • Related