Home > Blockchain >  Problems While Scraping with VBA - VBA Web Scraping
Problems While Scraping with VBA - VBA Web Scraping

Time:12-02

I'm trying to get a number from a static page in a website, but when I do the HTML request, the result is a strange HTML without the informations of the original html that I want.

The website that I'm trying to get the information is: https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=233361&cvm=true but I can get the same result with: https://fnet.bmfbovespa.com.br/fnet/publico/visualizarDocumento?id=233361&cvm=true

The number that I want to get is the number "0,05" in the page

My code is:

Sub trying()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
    .Open "GET", "https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=233361&cvm=true&", False
    .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT" 'to deal with potential caching
    .send
    html.body.innerHTML = .responseText
End With
Set element = html.getElementsByTagName("td")(31).innerText
Sheets("Sheet1").Cells(1, 1) = element    
End Sub

I have also tried to do that using the InternetExplorer.Application but the the problem keeps the same

CodePudding user response:

After trying a bunch of request headers, Accept request header is required to return the response in HTML:

Sub trying()
    Dim html As HTMLDocument
    Set html = New HTMLDocument
        
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "http://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento?id=233361&cvm=true&", False
        .setRequestHeader "Accept", "text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
        .send
        html.body.innerHTML = .responseText
    End With
    
    Dim element As String
    element = html.getElementsByTagName("td")(32).innerText
    Sheets("Sheet1").Cells(1, 1) = element
End Sub

CodePudding user response:

Just add a .htm (or .html) extension to the request to specify file type wanted.

Option Explicit

Public Sub trying()
    Dim html As MSHTML.HTMLDocument
    Set html = New MSHTML.HTMLDocument
    
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://fnet.bmfbovespa.com.br/fnet/publico/exibirDocumento.htm?id=233361", False
        .send
        html.body.innerHTML = .responseText
    End With
    Debug.Print html.querySelector("tr:nth-child(6) .dado-valores").innerText
End Sub

Alternatives, that are a little more effort, include adding the accept header or base64 decoding the .responseText before writing to .innerHTML.

If you went down the less preferable base64 decode route, using function from here (note: pretty sure this is not the original source), then you will need to amend the following line:

 .DataType = "bin.base64": .Text = Replace$(b64, Chr$(34), vbNullString) 'modified line

Public Function DecodeBase64(b64$)
    Dim b
    With CreateObject("Microsoft.XMLDOM").createElement("b64")
        .DataType = "bin.base64": .Text = Replace$(b64, Chr$(34), vbNullString) 'modified line
        b = .nodeTypedValue
        With CreateObject("ADODB.Stream")
            .Open: .Type = 1: .Write b: .Position = 0: .Type = 2: .Charset = "utf-8"
            DecodeBase64 = .ReadText
            .Close
        End With
    End With
End Function
  • Related