The following is a sample of the HTML Document I am working with
<tbody>
<tr>
<td>
Some text here
<div >Some title here</div>
</td>
</tr>
</tbody>
My code is the following:
With HTMLDocument.getElementsByTagName("table")(iTable)
For Each Tr In .Rows
For Each Td In Tr.Cells
Debug.Print Td.innerText
Sheets(UserDataSheetName).Cells(iRow, iCol).Value = Td.innerText
iCol = iCol 1
Next Td
iCol = Col_Num_To_Start
iRow = iRow 1
Next Tr
End With
Currently, what happens is that Td.innerText in my code not only gets the text "Some text here", but it also saves "Some title here" into my excel spreadsheet. Is there a way I can do it such that I only save "Some text here" into the spreadsheet?
CodePudding user response:
To make the macro work with an external source such as a webpage, we’ll need Microsoft’s Internet Explore (to open the page and will remain hidden). In-addition, to read and extract contents of HTML elements, we’ll have to create few objects using a library.
Therefore, first add the Microsoft HTML Object Library reference to the application. From the top menu of your VBA editor, click Tools -> References…. In the References window, find and select Microsoft HTML Object Library and click OK.
CodePudding user response:
If you work with methods like getElementsByTagName()
, you work with the DOM (Document Object Model). The DOM builds a tree of nodes from all elements of a page. There are 12 different node types. One of them is text
. A text
node is text which stands alone without a html tag. So you can't get it's value with innertext
. You have to use textContent
. To pick the right node, you can pick a html node before and than use commands like FirstChild
and NextSibling
to reach the wanted text node.
Look here for more infos about DOM nodes:
https://www.w3schools.com/jsref/prop_node_nodetype.asp
For your example it works in this way:
Sub GetTextNode()
Dim html As String
Dim doc As Object
html = "<table><tbody><tr><td>Some text here<div class='some class here'>Some title here</div></td></tr></tbody></table>"
Set doc = CreateObject("htmlFile")
doc.body.innerhtml = html
MsgBox doc.getElementsByTagName("td")(0).FirstChild.textContent
End Sub