Home > other >  Unable to parse webpage by converting to XmlDocument
Unable to parse webpage by converting to XmlDocument

Time:09-16

I am trying to parse a page from https://pinvoke.net using Windows PowerShell. Normally when I have an XML string, I can convert it to a more easily workable object by by casting the string to the [xml] type. However, when I try to parse the following page, I get an error. It doesn't like the src attribute on line 14:

$page = ( Invoke-WebRequest https://www.pinvoke.net/default.aspx/advapi32/CreateProcessAsUser.html ).Content
$xmlPage = [xml]$page # throws an error

The error (truncated since the message looks like it includes the full page content):

Cannot convert value "XML STRING HERE" to type "System.Xml.XmlDocument".
Error: "'src' is an unexpected token. The expected token is '='. Line 14, position 15."

The line in question looks like this:

<script async src = "https://www.googletagmanager.com/gtag/js?id=UA-115015704-1" ></script>

If I copy the XML to a file and either remove the line or remove async, then read the file and attempt to convert it again it gets further but I keep getting met with additional XML errors (there are two total async attributes I removed before I gave up due to additional parsing errors).

Why does the casting conversion with [xml] fail?

Edit:

Looks like ConvertTo-Xml converts the .NET object into an XML string. It's represented under the XmlDocument type but the most I can extract out of it is the same string. I've re-titled the question accordingly and removed the statements that ConvertTo-Xml was working correctly for me.

CodePudding user response:

While a boolean attribute like async is a valid HTML attribute, it is not a valid XML attribute (read more). So it is correct, that the conversion to XML fails.

You get different results, because the conversion through a type cast to [xml] really tries to parse the content to XML, while ConvertTo-Xml does something completely different. Look at the result of the following command:

('<script async src = "test.js"></script>' | ConvertTo-Xml).OuterXml

Output:

<?xml version="1.0" encoding="utf-8"?><Objects><Object Type="System.String">&lt;script async src = "test.js"&gt;&lt;/script&gt;</Object></Objects>

The pure string is converted into the inner text of an XML element. I guess, that's not what you want.

ConvertTo-Xml is designed to:

create an XML-based representation of one or more more .NET objects.

It does not convert a string containing XML into XML.


Not every HTML page consists of pure XML. So you can not rely on parsing every website to XML. There is XHTML though, which is valid XML. In XHTML, the script tag should look like this:

<script async="async" src = "test.js"></script>

The async attribute could have any other value, to be precise, as long as it has a value (read more).

In your case, I recommend to parse the website to HTML. Invoke-WebRequest does that already for you:

$html = ( Invoke-WebRequest https://www.pinvoke.net/default.aspx/advapi32/CreateProcessAsUser.html ).ParsedHtml
  • Related