Power query's Data from PDF not always reliable, possible to Iterate over url links to download-CodePudding

Kind of a 2 part question depending on whether or not the first is possible to change. I suspect it is not so please skip to problem 2:

Problem 1:

So with Power Query data sources, you can get data from both PDF files as well as PDFs from URLs. Therefore, it is possible to generate a function that can iterate over either a file or a list of URLs to perform data transformations to each. In my case I am interested in separating the text into its constituent sentences where paragraphs of text are present. The latter has been achieved. However, there is a particular issue when extracting data from certain PDFs.

For example :

So my first question is, does anyone know why this happens and if so if anything can be done to prevent this?

I have checked the PDF for hidden characters etc, but these do not appear to be present. Next I have checked to see if using some online text extraction tool delivers the same result.

Simply enter the same URL here

Produces a much more reliable result. So it makes sense if the first problem cannot be resolved to download all the PDFs from URL links in bulk and from this, convert these to text files and finally connect these to power query. I suspect this, although roundabout way, will produce much more reliable results.

Problem2/Question: How can I write a script that iterates over a list of Urls, Download each of the PDF files, and converts these to a text file?

The plan is to have a separate excel based tool to achieve this and then throw the raw text data into PQ for the extraction.

I will be posting progress made here, but if anyone can offer any info to either the first question or how to achieve this second question in VBA that would be great.

Progress: TBC

CodePudding user response：

Problem 1:

The function to convert a pdf is named PDF.Tables.

Returns any tables found in pdf.

The text is garbled because it is struggling to find any tables although it might be worth feeding this back to MS.

I would abandon this route as PQ is powerful but limited in these circumstances.

Problem 2:

Use Power Automate Desktop. Here is a good tutorial and it should allow you to do all the web scraping and conversion. https://www.youtube.com/watch?v=DgBZiBIgh3w

CodePudding user response：

You can use Word to convert the PDF's.

For example:

Sub Tester()

    Dim oWd As Object, oDoc As Object, i As Long, c As Range
    
    Set oWd = CreateObject("word.application")
    oWd.Visible = True
    
    For Each c In Range("A1:A5").Cells
        With oWd.Documents.Open(c.Value)
            c.Offset(0, 1).Value = .Range.Text
            'or write to a file...
            .Close
        End With
    Next c
    
    oWd.Quit
    
End Sub