Home > Blockchain >  Table scraping VBA
Table scraping VBA

Time:11-25

I have got the following code, which is supposed to scrape the table and print the result. I am expecting to get them into a table

County| Shropshire Address| Adams Grammar School, High Street, Newport TF10 7BD etc

but i am getting the following:

CountyShropshire AddressAdams Grammar School, High Street, Newport TF10 7BD TypeBoys Grammar Pupils799

so no speace inside.

what am i missing?

Sub scrapeschools()



Dim ie As New SHDocVw.InternetExplorer
Dim htmldoc As MSHTML.HTMLDocument

Dim htmltable As MSHTML.htmltable

Dim tablerow As MSHTML.IHTMLElement
Dim tablecell As MSHTML.IHTMLElement
Dim tablecol As MSHTML.IHTMLElement

ie.Visible = True

ie.navigate "https://www.11plusguide.com/grammar-school-test-areas/wolverhampton-shropshire-walsall/adams-grammar-school/"

Do While ie.readyState < READYSTATE_COMPLETE Or ie.Busy
Loop

Set htmldoc = ie.document


Set htmltable = htmldoc.getElementsByTagName("table")(0)


    For Each tablerow In htmltable.Children
    
            For Each tablecell In tablerow.Children
                Debug.Print tablecell.innerText
            Next tablecell
   
    Next tablerow


End Sub

CodePudding user response:

Your elements in the loop are not of the class you are expecting (based on your variable naming). This has gone without raising an error as you have declared the elements involved in the loop generically as MSHTML.IHTMLElement.

A debugging step could be to study the relevant HTML in order to understand what is produced by htmltable.Children.

Add a couple of Debug.Print TypeName(foo) calls within your loop to see what you are actually working with:

For Each tablerow In htmltable.Children
    Debug.Print TypeName(tablerow)
    For Each tablecell In tablerow.Children
        Debug.Print TypeName(tablecell)
    Next tablecell
Next tablerow

You are initially working with a section, then you are looping table rows not table cells, hence your squished output.

It is an easy fix.

  1. Use the .Rows property to ensure you are looping table rows and
  2. Use a stronger typing in the declarations
Dim tablerow As MSHTML.IHTMLTableRow
Dim tablecell As MSHTML.IHTMLTableCell

For Each tablerow In htmltable.Rows
    For Each tablecell In tablerow.Children
        Debug.Print tablecell.innerText
    Next tablecell
Next tablerow

enter image description here


Personally, I would:

  1. See if this page could be scraped using XHR requests as this would be faster and remove the now no longer supported IE
  2. Select the table with a more specific, and hopefully more stable over time, css selector list
  3. Use the clipboard to copy paste the table element OuterHTML to Excel. This would preserve the table formatting and avoid the added complexity of nested looping
  • Related