The Objection is to extract a table from a given PDF file and convert the whole table to an pd dataframe for further operations. Obviously, the whole table will only contain strings in it.
While the code itself is working, when converting the extracted table to a dataframe, every string which had originally a break in its cell from the table appears with "\r" in between the words
Example: Original Appearance in cell: "Neues Wh..."
Should look like: "Neues Wh..."
Result after converting to df: "Neues\rWh..."
See my code below:
import pandas as pd
import win32com.client
from win32com.client import Dispatch, constants
import codecs
import os
import io
import tabula
from tabula import read_pdf
from tabulate import tabulate
mapping = {df.columns[0]: 'x1',
df.columns[1]: 'x2',
df.columns[2]: 'x3',
df.columns[3]: 'x4?',
df.columns[4]: 'x5',
df.columns[5]: 'x6',
df.columns[6]: 'x7',
df.columns[7]: 'x8'}
pdf_template_path = os.path.join(r'H:\folder\ pdf-file')
pdf_template_path1 = pdf_template_path '.pdf'
pdf_table = read_pdf(pdf_template_path1,
pages = 'all',
multiple_tables = True,
lattice= True,
pandas_options={'header': None}
)
# Transform the result into a string table format
table = tabulate(pdf_table)
# Transform the table into dataframe
df = pd.read_fwf(io.StringIO(table))
df.rename(columns= mapping, inplace= True)
df_pdf.style.set_properties(subset=['Beschreibung'], **{'width': '300px'})
display(df.head())
df.shape
Following result: result
As you can see by the picture, sometimes the Carriage Return sequence "\r" appears between the words, i.e.: 'Neues\rWh..', but the result should look like this: 'Neues Wh..'.
I tried methods like replace():
df = df.replace('\r', '', regex= True)
EDIT: But it didn't work, as the strings in the df remains the same, see the result-picture: result after df_replace
I'm thankful for any advice.
CodePudding user response:
Solved. The solution here is:
df = df.replace(r'\\r', ' ', regex= True)
as r'\\'
disable the first \
. Thus, '\r'
can be handled as normal character of a string.