Home > Back-end >  Can't write all the results from three links instead of the results from the last link only
Can't write all the results from three links instead of the results from the last link only

Time:12-12

I've written a macro to scrape some fields from three pages of a website. I used Array() to store and write the results in order to make the execution a little faster.

The script is doing fine as long as the content from an individual page is concerned. However, things go wrong when I use three links from a list. To be specific, the script overwrites previous results. For example, I am supposed to get 150 results after the execution. Instead, I get 50 results from the last link.

I've written so far:

Public Sub FetchData()
    Dim Xhr As Object, Html As HTMLDocument, Ws As Worksheet
    Dim Link As Variant, Links As Variant, LeadInfo() As String
    Dim I&, HtmlDoc As HTMLDocument, Listings As Object, Headers()
    Dim URLS(), N As Variant
    
    Links = Array( _
        "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=1&pagesize=50", _
        "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50", _
        "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50" _
    )
    
    Set Ws = ThisWorkbook.Worksheets("Sheet1")
    Set Xhr = CreateObject("MSXML2.XMLHTTP")
    Set Html = New HTMLDocument
    Set HtmlDoc = New HTMLDocument

    Headers = Array("Title", "URL", "User", "Asked")

    For Each Link In Links
        With Xhr
             .Open "GET", Link, False
             .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
             .send
             Html.body.innerHTML = .responseText
         End With

         Set Listings = Html.querySelectorAll(".summary")
         ReDim LeadInfo(1 To Listings.Length, 1 To 4)

         On Error Resume Next
         For I = 0 To Listings.Length - 1
             HtmlDoc.body.innerHTML = Listings.item(I).innerHTML
             LeadInfo(I   1, 1) = HtmlDoc.querySelector(".question-hyperlink").innerText
             LeadInfo(I   1, 2) = HtmlDoc.querySelector(".question-hyperlink").getAttribute("href")
             LeadInfo(I   1, 3) = HtmlDoc.querySelector(".user-details > a").innerText
             LeadInfo(I   1, 4) = HtmlDoc.querySelector(".user-action-time > span.relativetime").innerText
         Next I
         On Error GoTo 0

         If IsEmpty(Ws.Cells(1, 1).Value) Then Ws.Cells(1, 1).Resize(1, UBound(Headers)   1) = Headers
         Ws.Cells(2, 1).Resize(UBound(LeadInfo, 1), UBound(LeadInfo, 2)) = LeadInfo
     Next Link
End Sub

How can I write all the results from the three links instead of the results from the last link alone?

CodePudding user response:

You already have the page size (i.e. max results per page), number of pages and size of headers. Simply dimension an array to store results in and write that out once. More efficient than repeatedly ReDim, which copies an array, and writing out which incurs I/O.

Use a variable to keep track of the row to populate in the array.

Move your writing out to outside of the loop.

Declare Listings As MSHTML.IHTMLDOMChildrenCollection so as to serve more recent Excel versions(backwards compatible to 2010 at least).

Option Explicit

Public Sub FetchData()
    Dim Xhr As Object, Html As MSHTML.HTMLDocument, Ws As Worksheet
    Dim Link As Variant, Links() As Variant, LeadInfo() As String
    Dim I As Long, HtmlDoc As MSHTML.HTMLDocument, Listings As MSHTML.IHTMLDOMChildrenCollection
    Dim Headers() As Variant
    
    Links = Array( _
            "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=1&pagesize=50", _
            "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=2&pagesize=50", _
            "https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page=3&pagesize=50" _
            )
    
    Set Ws = ThisWorkbook.Worksheets("Sheet1")
    Set Xhr = CreateObject("MSXML2.XMLHTTP")
    Set Html = New HTMLDocument
    Set HtmlDoc = New HTMLDocument

    Headers = Array("Title", "URL", "User", "Asked")

    ReDim LeadInfo(1 To (UBound(Links)   1) * 50, 1 To UBound(Headers)   1) 'size according to headers and page size
    
    Dim rowNumber As Long
    
    For Each Link In Links
        With Xhr
            .Open "GET", Link, False
            .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
            .send
            Html.body.innerHTML = .responseText
        End With

        Set Listings = Html.querySelectorAll(".summary")

        On Error Resume Next
        For I = 0 To Listings.Length - 1
            rowNumber = rowNumber   1
            HtmlDoc.body.innerHTML = Listings.Item(I).innerHTML
            LeadInfo(rowNumber, 1) = HtmlDoc.querySelector(".question-hyperlink").innerText
            LeadInfo(rowNumber, 2) = Replace$(HtmlDoc.querySelector(".question-hyperlink").href, "about:", "https://stackoverflow.com")
            LeadInfo(rowNumber, 3) = HtmlDoc.querySelector(".user-details > a").innerText
            LeadInfo(rowNumber, 4) = HtmlDoc.querySelector(".user-action-time > span.relativetime").innerText
        Next I
        On Error GoTo 0

    Next Link
     
    If IsEmpty(Ws.Cells(1, 1).Value) Then Ws.Cells(1, 1).Resize(1, UBound(Headers)   1) = Headers
    Ws.Cells(2, 1).Resize(UBound(LeadInfo, 1), UBound(LeadInfo, 2)) = LeadInfo
    
End Sub
  • Related