Home > database >  Scraping data from website with dynamic array function in vba
Scraping data from website with dynamic array function in vba

Time:12-10

I'd like to know more about how array function is applied when scrape data from website. I'm currently using this vba to copy data from website. The code can scrape the data I want, however when it comes to copy data to the destination worksheet it copies all data to A1 cell. Since this vba was developed for my previous project and works fine I'm not sure which part went wrong.

enter image description here

Sub CopyFromHKAB()
    Dim ie As Object, btnmore As Object, tbl As Object
    Dim rr As Object, cc As Object, r As Integer, c As Integer, i As Integer, j As Integer
    
    ThisWorkbook.Sheets("data").UsedRange.Clear
    
    Set ie = CreateObject("internetexplorer.application")
    With ie
        .Visible = True
        .navigate "https://www.hkab.org.hk/DisplayMemberAction.do?sectionid=4&subsectionid=0"
        
        Do
            DoEvents
        Loop While .readyState <> 4 Or .Busy
          
        
        Set tbl = .document.getElementsByClassName("etxtmed")(2)
            
    End With
    
    'get data from table
    r = tbl.Rows.Length - 1
    c = tbl.Rows(0).Cells.Length - 1
    
    ReDim arr(0 To r, 0 To c)
    
    Set rr = tbl.Rows
    For i = 0 To r
        Set cc = rr(i).Cells
        For j = 0 To c
            arr(i, j) = cc(j).innertext
        Next
    
    Next
    
    ie.Quit
  
    Application.ScreenUpdating = False
    ThisWorkbook.Sheets("Sheet1").Cells(1, 1).Resize(r   1, c   1) = arr
    
    With ThisWorkbook.Sheets("data")
        .UsedRange.WrapText = False
        .Columns.AutoFit
    End With
    
End Sub

CodePudding user response:

You need to pick up the right table given they are nested so change the index to 3. Then you need to adjust your code to skip the first row. Otherwise, you are picking up the shared parent and thus all the listings are in fact within the one child element hence your current output.

N.B. You don't actually need IE for this. And you are writing out data to a different sheet than the one you end format.

Sub CopyFromHKAB()
    Dim ie As Object, btnmore As Object, tbl As Object
    Dim rr As Object, cc As Object, r As Integer, c As Integer, i As Integer, j As Integer
    
    ThisWorkbook.Sheets("data").UsedRange.Clear
    
    Set ie = CreateObject("internetexplorer.application")
    With ie
        .Visible = True
        .navigate "https://www.hkab.org.hk/DisplayMemberAction.do?sectionid=4&subsectionid=0"
        
        Do
            DoEvents
        Loop While .readyState <> 4 Or .Busy
          
        
        Set tbl = .document.getElementsByClassName("etxtmed")(3)
                
    End With
    
    'get data from table
    r = tbl.Rows.Length - 1
    c = tbl.Rows(1).Cells.Length - 1
    
    ReDim arr(0 To r, 0 To c)

    Set rr = tbl.Rows

    For i = 1 To r
    
        Set cc = rr(i).Cells
        For j = 0 To c
            arr(i - 1, j) = cc(j).innertext
        Next
    
    Next
    
    ie.Quit
  
    'Application.ScreenUpdating = False
    ThisWorkbook.Sheets("Sheet1").Cells(1, 1).Resize(r   1, c   1) = arr

    With ThisWorkbook.Worksheets("data")
        .UsedRange.WrapText = False
        .Columns.AutoFit
    End With
    
End Sub

I would consider switching to XHR to avoid overhead of browser, and using querySelectorAll to allow for using a css selector list to target only the nodes of interest

Option Explicit

Public Sub GetHKABInfo()
    'tools > references > Microsoft HTML Object Library
    Dim html As MSHTML.HTMLDocument, xhr As Object
     
    Set xhr = CreateObject("MSXML2.XMLHTTP")
    Set html = New MSHTML.HTMLDocument
     
    With xhr
        .Open "GET", "https://www.hkab.org.hk/DisplayMemberAction.do?sectionid=4&subsectionid=0", False
        .setRequestHeader "User-Agent", "Safari/537.36"
        .send
        html.body.innerHTML = .responseText
    End With
    
    Dim arr() As Variant, nodes As MSHTML.IHTMLDOMChildrenCollection, i As Long
    
    Set nodes = html.querySelectorAll(".etxtmed .etxtmed td")
    
    ReDim arr(1 To nodes.Length - 1)
    
    For i = LBound(arr) To UBound(arr)
        arr(i) = nodes.Item(i).innertext
    Next
    
    ThisWorkbook.Worksheets("Sheet1").Cells(1, 1).Resize(UBound(arr), 1) = Application.Transpose(arr)
    
End Sub
  • Related