Home > Software design >  How to get rid of '\r' when exttracting and printing a table from a pdf file?
How to get rid of '\r' when exttracting and printing a table from a pdf file?

Time:04-30

The Objection is to extract a table from a given PDF file and convert the whole table to an pd dataframe for further operations. Obviously, the whole table will only contain strings in it.

While the code itself is working, when converting the extracted table to a dataframe, every string which had originally a break in its cell from the table appears with "\r" in between the words

Example: Original Appearance in cell: "Neues Wh..."

Should look like: "Neues Wh..."

Result after converting to df: "Neues\rWh..."

See my code below:

import pandas as pd
import win32com.client
from win32com.client import Dispatch, constants
import codecs
import os
import io

import tabula
from tabula import read_pdf
from tabulate import tabulate

mapping = {df.columns[0]: 'x1',
           df.columns[1]: 'x2',
           df.columns[2]: 'x3',
           df.columns[3]: 'x4?',
           df.columns[4]: 'x5',
           df.columns[5]: 'x6',
           df.columns[6]: 'x7',
           df.columns[7]: 'x8'}

pdf_template_path = os.path.join(r'H:\folder\ pdf-file')
pdf_template_path1 = pdf_template_path   '.pdf'

pdf_table = read_pdf(pdf_template_path1,
                     pages = 'all', 
                     multiple_tables = True,
                     lattice= True, 
                     pandas_options={'header': None}
)

# Transform the result into a string table format
table = tabulate(pdf_table)

# Transform the table into dataframe
df = pd.read_fwf(io.StringIO(table))

df.rename(columns= mapping, inplace= True)
df_pdf.style.set_properties(subset=['Beschreibung'], **{'width': '300px'})

display(df.head())
df.shape

Following result: result

As you can see by the picture, sometimes the Carriage Return sequence "\r" appears between the words, i.e.: 'Neues\rWh..', but the result should look like this: 'Neues Wh..'.

I tried methods like replace():

df = df.replace('\r', '', regex= True)

EDIT: But it didn't work, as the strings in the df remains the same, see the result-picture: result after df_replace

I'm thankful for any advice.

CodePudding user response:

Solved. The solution here is:

df = df.replace(r'\\r', ' ', regex= True)

as r'\\' disable the first \. Thus, '\r' can be handled as normal character of a string.

  • Related