how to remove multiple headers-CodePudding

I have a spreadsheet that is in a pdf where I extract these values and transform them into .csv with textract from aws using Python. However, when I extract the values, there are several headers and I would like to keep only the first header.

account ;description ;old balance ;debit ;credit ;mov. ;balance ; **# --> first header**
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;

account ;description ;old balance ;debit ;credit ;mov. ;balance ; **# --> second header**
2.00 ;: investments ;120.0400.0 ;20.000.000.0 ;82.840.400.0 ;-100.2 ;314.262.0 ;

;;;;debit ;credit ;mov. ;balance ; **# --> third header**
3.00 ;real state ;1.000.200.4 ;4.000.031.47 ;2.273.121,44 ;-144.089.77 ;254.844.390,75 ;

Note that in the same .csv file I have 3 headers and one of them only has a few values, but as I want to remove it, I believe it doesn't matter so much. So how to remove the other headers? using python. Each pdf has a different header, so I believe I can use the same solution in the others

OBS: this is the way I transform the string into csv

 # replace content
         with open("file_name.csv", "at") as fout:
            fout.write(table_csv)

I haven't tried any solutions as I can't think of anything useful

CodePudding user response：

You can use re module to remove the duplicate headers. For example:

text = """\
account ;description ;old balance ;debit ;credit ;mov. ;balance ;
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 ;

account ;description ;old balance ;debit ;credit ;mov. ;balance ;
2.00 ;: investments ;120.0400.0 ;20.000.000.0 ;82.840.400.0 ;-100.2 ;314.262.0 ;

;;;;debit ;credit ;mov. ;balance ;
3.00 ;real state ;1.000.200.4 ;4.000.031.47 ;2.273.121,44 ;-144.089.77 ;254.844.390,75 ;"""

import re
import pandas as pd
from io import StringIO

# remove the headers
text = re.sub(r"(?m)\n\n^.*$", "", text.strip())

# remove ; at end of lines
text = re.sub(r"(?m);\s*$", "", text.strip())

print(text)

Prints:

account ;description ;old balance ;debit ;credit ;mov. ;balance 
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 
1.00 ;: investments ;212.844.26 ;63.856.811,44 ;63.857.250.69 ;-439.25 ;212.405.01 
2.00 ;: investments ;120.0400.0 ;20.000.000.0 ;82.840.400.0 ;-100.2 ;314.262.0 
3.00 ;real state ;1.000.200.4 ;4.000.031.47 ;2.273.121,44 ;-144.089.77 ;254.844.390,75

Then you can load the text to a DataFrame:

df = pd.read_csv(StringIO(text), delimiter=";")
print(df)

Prints:

   account     description   old balance           debit          credit          mov.          balance 
0       1.0  : investments    212.844.26   63.856.811,44   63.857.250.69       -439.25       212.405.01 
1       1.0  : investments    212.844.26   63.856.811,44   63.857.250.69       -439.25       212.405.01 
2       2.0  : investments    120.0400.0    20.000.000.0    82.840.400.0        -100.2        314.262.0 
3       3.0     real state   1.000.200.4    4.000.031.47    2.273.121,44   -144.089.77   254.844.390,75

To save to CSV:

df.to_csv('file_name.csv', index=False)