Text heavy PDF/HTML into a standardized and specified Excel format displaying extracted and tagged i-CodePudding

I have been struggling for quite some time to pull data from a pdf to a specifically formatted table structure. For the purpose of example, I tried to use the following link: https://eur-lex.europa.eu/resource.html?uri=cellar:e0649735-a372-11eb-9585-01aa75ed71a1.0001.02/DOC_1&format=PDF (link to download)

What I would like to be able to see happen: An excel table is developed that shows in the header row: Title Section Subtitle Section Paragraph

And it would show the same type of information for every paragraph.

WHERE I AM

Unfortunately, I was not able to make progress in reading the pdf to do what I needed, so I tried to use the website (not ideal for my table) where the same information is available (except for the page#s). I was able to attempt use of beautiful soup at the tags. And hard code a series of if statements to output tagged content. For some reason, I thought tagging each paragraph into a formatted text file would make it easier then to pull information from and into an excel sheet. However, there appears to be an issue of duplication of the referenced
There seems to be a repeat of several of the Section and Sub-sections; and as you see from what I have coded, my static approach is very rudimentary. At this stage, I'm not concerned as much as about speed and efficiency as much as trying to get it working properly.
From the various coding book and sites I have reviewed and searched for sample code on; it seems as if nltk or regex could deal with the various document structural changes and achieve what I have done in a much more condensed fashion and that writing out the entire paragraph to an excel sheet is feasible in the format mentioned. However, if so, I have no idea how it would look or work - but would be open to productive guidance. Currently, I do not have the excel working properly either. The header rows are correct, but beyond that I have not been able to successfully do anything further.

I'm asking for help as at this stage I am running in circles; and could use expert guidance at this stage.

Sincerely, Newbie

Code:

    import re
    import bs4
    from bs4 import BeautifulSoup as bs

    #text2 string from document HTML site less blank spaces and newlines using     
    #https://24toolbox.com/newline-remover/
    #a short version provided in text2 below as not enough space to include the    
    #full; but without the full string the below code is not very useful so I   
    #suggest pulling from the web version (or if you attempt the pdf version 
    #then from the pdf document but I don't think the code then will work at all

    #string

    text2 ='''<!DOCTYPE html> <html lang="en"  xml:lang="en"><head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible"  content="IE=edge"> <meta name="viewport" content="width=device-width,initial-scale=1"> <script src="https://ec.europa.eu/wel/cookie-consent/consent.js" type="text/javascript"></script> <script type="text/javascript" src="./../../../revamp/components/vendor/modernizr/modernizr.js?v=2.10.4"></script> <title>EUR-Lex - 52021PC0206 - EN - EUR-Lex</title> <meta name="WT.z_docTitle" content="Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE ACTS">  <p > </p> </body> </html>'''
    soup=bs(text2, 'html.parser')
    
    Title = "Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE ACTS"

    full_text = ''
    paragraphs = soup.findAll('span')

     #THEN HAVE THE FILE WRITE TO AN EXCEL SHEET

     import xlwt
     from xlwt import Workbook
     wb = Workbook()

     #HEADERS IN SPREADSHEET
     sheet1=wb.add_sheet('Mapping')
     sheet1.write(0,0,'Page#')
     sheet1.write(0,1,'Title')
     sheet1.write(0,2,'Section Sub-section')
     sheet1.write(0,3,'Section')
     sheet1.write(0,4,'Paragraph')


for p in paragraphs:
    if p.text.strip()==str('1.'): text=p.parent.text str("<Page 1-5><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('1.1.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section>") ("\n") str("<Page 1 - 3><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('1.2.'): text=("\n") p.parent.text ("\n") str("</Sub-Section>") ("\n") str("<Page 4><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('1.3.'): text=("\n") p.parent.text ("\n") str("</Sub-Section>") ("\n") str("<Page 5><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('2.'): text=("\n") p.parent.text ("\n") str("</Sub-Section>") ("\n") str("<Page 6-7><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('2.1.'): text=("\n") str("</Sub-Section><Page 6><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('2.2.'): text=("\n") p.parent.text ("\n") str("</Sub-Section><Page 6><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('2.3.'): text=("\n") p.parent.text ("\n") str("</Sub-Section><Page 7><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('2.4.'): text=("\n") p.parent.text ("\n") str("</Sub-Section><Page 7><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('3.'): text=("\n") p.parent.text ("\n") str("</Sub-Section><Page 7-11><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('3.1.'): text=("\n") p.parent.text ("\n") str("</Sub-Section><Page 7-8><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('3.2.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 8-9><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('3.3.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 9-10><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('3.4.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 10-11><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('4.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 11-12><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 12-<Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.1.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 12><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.2.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 12><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.2.1.'): text= ("\n") str("</Sub-Section><Page 12><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.2.2.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 12-13><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.2.3.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 13-14><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.2.4.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 14-15><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.2.5.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 15><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.2.6.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 15><Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.2.7.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 16<Sub-section:") p.text.strip() str(">") ("\n")
    elif p.text.strip()==str('5.2.8.'): text= ("\n") p.parent.text ("\n") str("</Sub-Section><Page 16>") ("\n") str("<Sub-section:") p.text.strip() str(">") ("\n")
    else:text = p.text.replace(',', '').replace('"', '').replace("'", "").replace('?', '').replace("\n", "").replace('\r', '')
    full_text  = ' '  text  ' '

  #writes out to a file
  filename = "Fileprint.txt"
  file_object = open (filename,'w')
  file_object.write(full_text)

I appreciate any help or guidnace that may be provided. It would be great to understand if a working version is even possible; and if so, how I can take the proper steps to get there. Thank you for your time to read this question and for any thing you may be able to do to help me get over the wall.

CodePudding user response：

"Text-heavy" and "VBA" sounds like unnecessary difficulty. How about using Julia/python etc to create the table. If the end result needs to be Excel you can publish it by generating a CSV file for Excel to read.

CodePudding user response：

Use a Camelot-py module to read pdf, it has few lines of code and its documentation is too easy. it can read PDF files using PYPDF2 as well as extracted tables present in PDF files in a data frame. and regex is not that hard you think, Using the regex101 site to make regex easily, I also work on the same project. https://camelot-py.readthedocs.io/en/master/