I have the following Code in VBA following an answer to my
URLs:
https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7
https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725
https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b
https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4
https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd
https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683
https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea
https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b
https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16
https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6
So firstly, I think it's likely that the issue with the last URL can be solved. However, I think I also need to introduce some error handling that generates a blank text file and moves on to the next but im not sure how to achieve this.
I am not very competent with VBA, I have specified to repeat for rows B2:B200, but ideally, it would be good if, regardless of how many URLs, it just works to that number.
Also not sure if the logic in my code is particularly robust/if there is a better way to extract text from a URL.
The expected output is:
And here is an example of the generated Text file.
CodePudding user response:
Seems like that document at that problem URL contained some characters which couldn't be written to a non-unicode text file.
See comments inline:
Sub Tester()
Dim filePath As String
Dim fso As FileSystemObject, url
Dim fileStream As TextStream, ws As Worksheet
Dim oWd As Object, oDoc As Object, c As Range, fileRoot As String
Set fso = New FileSystemObject
Set oWd = CreateObject("word.application")
Set ws = Worksheets("Data") 'use a specific worksheet reference
fileRoot = ws.Range("D2").Value 'read this once
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
url = Trim(c.Value)
If LCase(url) Like "http?:*" Then 'has a URL
Set oDoc = Nothing
On Error Resume Next 'ignore error if no document...
Set oDoc = oWd.Documents.Open(url)
On Error GoTo 0 'stop ignoring errors
If Not oDoc Is Nothing Then
filePath = fileRoot & c.Offset(0, -1).Value & ".txt" 'filename from ColA
Debug.Print filePath
'open text stream as unicode
Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
fileStream.Write oDoc.Range.Text
fileStream.Close
oDoc.Close
c.Interior.Color = vbGreen 'flag OK
Else
c.Interior.Color = vbRed 'flag problem
End If
End If 'have url
Next c
oWd.Quit
End Sub
CodePudding user response:
using excel notepad or any text means your comfortable with such as export columns as csv build a cmd file that can have extra constructs like not show during runtime, but personally like to see confirmed progress NOTE using call
the second parameter must be "quoted".
download2txt.cmd
call URLpdf2txt Name1 "https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7"
call URLpdf2txt Name2 "https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725"
call URLpdf2txt Name3 "https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b"
call URLpdf2txt Name4 "https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4"
call URLpdf2txt Name5 "https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd"
call URLpdf2txt Name6 "https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683"
call URLpdf2txt Name7 "https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea"
call URLpdf2txt Name8 "https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b"
call URLpdf2txt Name9 "https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16"
call URLpdf2txt Name10 "https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6"
in comment you mention pdftotext is not installed as a native command so first step would be to ensure a local copy so a helper script as specified URLpdf2txt
would be
URLpdf2txt.cmd
@echo off
if not exist xpdf-tools-win-4.04/bin32/pdftotext.exe curl -o %temp%\xpdftools.zip https://dl.xpdfreader.com/xpdf-tools-win-4.04.zip && tar -m -xf %temp%\xpdftools.zip xpdf-tools-win-4.04/bin32/pdftotext.exe
curl -o "%~dpn1.pdf" %2
"xpdf-tools-win-4.04/bin32/pdftotext.exe" -nopgbrk -layout -enc UTF-8 "%~dpn1.pdf" "%~dpn1.txt"