I have a spreadsheet that is in a pdf where I extract these values and transform them into .csv with textract from aws using Python. However, when I extract the values, there are several headers and I would like to keep only the first header.
account ;description ;old balance ;debit ;credit ;mov. ;balance ; **# --> first header**
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;
account ;description ;old balance ;debit ;credit ;mov. ;balance ; **# --> second header**
2.00 ;: investments ;120.0400.0 ;20.000.000.0 ;82.840.400.0 ;-100.2 ;314.262.0 ;
;;;;debit ;credit ;mov. ;balance ; **# --> third header**
3.00 ;real state ;1.000.200.4 ;4.000.031.47 ;2.273.121,44 ;-144.089.77 ;254.844.390,75 ;
Note that in the same .csv file I have 3 headers and one of them only has a few values, but as I want to remove it, I believe it doesn't matter so much. So how to remove the other headers? using python. Each pdf has a different header, so I believe I can use the same solution in the others
OBS: this is the way I transform the string into csv
# replace content
with open("file_name.csv", "at") as fout:
fout.write(table_csv)
I haven't tried any solutions as I can't think of anything useful
CodePudding user response:
You can use re
module to remove the duplicate headers. For example:
text = """\
account ;description ;old balance ;debit ;credit ;mov. ;balance ;
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;
account ;description ;old balance ;debit ;credit ;mov. ;balance ;
2.00 ;: investments ;120.0400.0 ;20.000.000.0 ;82.840.400.0 ;-100.2 ;314.262.0 ;
;;;;debit ;credit ;mov. ;balance ;
3.00 ;real state ;1.000.200.4 ;4.000.031.47 ;2.273.121,44 ;-144.089.77 ;254.844.390,75 ;"""
import re
import pandas as pd
from io import StringIO
# remove the headers
text = re.sub(r"(?m)\n\n^.*$", "", text.strip())
# remove ; at end of lines
text = re.sub(r"(?m);\s*$", "", text.strip())
print(text)
Prints:
account ;description ;old balance ;debit ;credit ;mov. ;balance
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01
2.00 ;: investments ;120.0400.0 ;20.000.000.0 ;82.840.400.0 ;-100.2 ;314.262.0
3.00 ;real state ;1.000.200.4 ;4.000.031.47 ;2.273.121,44 ;-144.089.77 ;254.844.390,75
Then you can load the text to a DataFrame:
df = pd.read_csv(StringIO(text), delimiter=";")
print(df)
Prints:
account description old balance debit credit mov. balance
0 1.0 : investments 212.844.26 63.856.811,44 63.857.250.69 -439.25 212.405.01
1 1.0 : investments 212.844.26 63.856.811,44 63.857.250.69 -439.25 212.405.01
2 2.0 : investments 120.0400.0 20.000.000.0 82.840.400.0 -100.2 314.262.0
3 3.0 real state 1.000.200.4 4.000.031.47 2.273.121,44 -144.089.77 254.844.390,75
To save to CSV:
df.to_csv('file_name.csv', index=False)