Home > Net >  Which version of GetAttributeValue of the 'HTML Agility Pack' is used when calling from Po
Which version of GetAttributeValue of the 'HTML Agility Pack' is used when calling from Po

Time:09-13

I am writing a PowerShell script to work in Windows 10. I am using the 'HTML Agility Pack' library version 1.11.43.

In this library, there is a GetAttributeValue method for HTML element nodes in four versions:

  1. public string GetAttributeValue(string name, string def)
  2. public int GetAttributeValue(string name, int def)
  3. public bool GetAttributeValue(string name, bool def)
  4. public T GetAttributeValue<T>(string name, T def)

I have written a test script for these methods on PowerShell:

$libPath = "HtmlAgilityPack.1.11.43\lib\netstandard2.0\HtmlAgilityPack.dll"
Add-Type -Path $libPath
$dom = New-Object -TypeName "HtmlAgilityPack.HtmlDocument"
$dom.Load("test.html", [System.Text.Encoding]::UTF8)

foreach ($node in $dom.DocumentNode.DescendantNodes()) {
    if ("#text" -ne $node.Name) {
        $node.OuterHTML
        "    "   $node.GetAttributeValue("class", "")
        "    "   $node.GetAttributeValue("class", 0)
        "    "   $node.GetAttributeValue("class", $true)
        "    "   $node.GetAttributeValue("class", $false)
        "    "   $node.GetAttributeValue("class", $null)
    }
}

File 'test.html':

<p ></p>
<p ></p>
<p></p>
<p ></p>

Test script execution result:

<p ></p>
    true
    0
    True
    True
    True
<p ></p>
    false
    0
    False
    False
    False
<p></p>

    0
    True
    False
    False
<p ></p>
    any other text
    0
    True
    False
    False

I know that to get the attribute value of an HTML element, you can also use the expression $node.Attributes["class"]. I also understand what polymorphism and method overloading are. I also know what a generic method is. I don't need to explain that.

I have three questions:

  1. When called $node.GetAttributeValue("class", $null) from a PowerShell script, which of the four variants of the GetAttributeValue method works?

  2. I think the fourth option works (generic method). Then why does a call with the second parameter $null work exactly the same as a call with the second parameter $false?

  3. In the C# source code, the fourth option requires the following condition to work

#if !(METRO || NETSTANDARD1_3 || NETSTANDARD1_6)

I tried the library versions for NETSTANDARD1_6 and for NETSTANDARD2_0. The test script works the same way. But with NETSTANDARD1_6 the fourth option should be unavailable, right? Then when NETSTANDARD1_6 then which version of the method GetAttributeValue works with the second parameter $null?

CodePudding user response:

tl;dr

To achieve what you unsuccessfully attempted with
$node.GetAttributeValue("class", $null), i.e., to return the attribute value as a [string] and default to $null if there is none, use:

$node.GetAttributeValue("class", [string] [NullString]::Value)

[string] $null works too, but makes "" (the empty string) rather than $null the default value.


While the overload resolution that you're seeing is surprising, you can resolve ambiguity during PowerShell's method overload resolution with casts:

$dom = [HtmlAgilityPack.HtmlDocument]::new()
$dom.LoadHtml(@'
<p ></p>
<p class=42></p>
<p></p>
<p ></p>
'@)

$nodes = $dom.DocumentNode.SelectNodes('p')

# Note the use of explicit casts (e.g., [string]) to guide overload resolution.
$nodes[0].GetAttributeValue('class', [bool] $false)
$nodes[1].GetAttributeValue('class', [int] 0)
$nodes[2].GetAttributeValue('class', [string] 'default')
$nodes[3].GetAttributeValue('class', [string] [NullString]::Value)

Output:

True
42
default
any other text

Alternatively, in PowerShell (Core) 7.3 [1], you can now call generic methods with explicit type arguments:

# PS 7.3 
# Note the generic type argument directly after the method  name.
# Calls the one and only generic overload, with various types substituted for T:
#   public T GetAttributeValue<T>(string name, T def)
# Note how the 2nd argument doesn't need a cast anymore.
$nodes[0].GetAttributeValue[bool]('class',  $false)
$nodes[1].GetAttributeValue[int]('class', 0)
$nodes[2].GetAttributeValue[string]('class', 'default')
$nodes[3].GetAttributeValue[string]('class', [NullString]::Value)

Note:

  • When you pass $null to a [string] typed parameter (both in cmdlets and .NET methods), PowerShell actually converts it quietly to "" (the empty string). [NullString]::Value tell's PowerShell to pass a true null instead, and is mostly needed for calling .NET methods where a behavioral distinction can result from passing null vs. "".

  • Therefore, if you were to call $nodes[3].GetAttributeValue('class', [string] $null) or, in PS 7.3 , $nodes[3].GetAttributeValue[string]('class', $null), you'd get "" (empty string) as the default value if attribute class doesn't exist.

  • By contrast, [NullString]::Value, as used in the commands above, causes a true $null value to be returned if the attribute doesn't exist; you can test for that with $null -eq ....


As for your questions:

On a general note, PowerShell's overload resolution is complex, and for the ultimate source of truth you'll have to consult the source code. The following is based on the de-facto behavior as of PowerShell 7.2.6 and musings about logic that could be applied.

When calling $node.GetAttributeValue("class", $null) from a PowerShell script, which of the four variants of the GetAttributeValue method works?

In practice, the public bool GetAttributeValue(string name, bool def) overload is chosen; why it, specifically, is chosen among the available overloads is ultimately immaterial, because the fundamental problem is that to PowerShell, $null provides insufficient information as to the type it may be a stand-in for, so it cannot generally be expected to select a specific overload (for the latter, you need a cast, as shown at the top):

  • In C# passing null to the second parameter in a non-generic call unambiguously implies the overload with the string-typed def parameter, because among the non-generic overloads, string as the type of the def parameter is the only .NET reference type, and therefore the only type that can directly accept a null argument.

  • This is not true in PowerShell, which has much more flexible, implicit type-conversion rules: from PowerShell's perspective, $null can bind to any of the types among the def parameters, because it allows $null to be converted to those types; specifically, [bool] $null yields $false, [int] $null yields 0, and - perhaps surprisingly, as discussed above - [string] $null yields "" (the empty string).

    • Thus, PowerShell is justified in selecting any one of the non-generic overloads in this case, and which one it chooses should be considered an implementation detail.

However, curiously, even using [NullString]::Value doesn't make a difference, even though PowerShell should know that this special value represents a $null value for a string parameter - see GitHub issue #18072


I think the fourth option works (generic method). Then why does a call with the second parameter $null work exactly the same as a call with the second parameter $false?

With the generic invocation syntax available in v7.3 , the generic overload definitely works - and a $null as the default-value argument is converted to the type specified as the type argument (assuming PowerShell allows such a conversion; it wouldn't work with [datetime], for instance, because [datetime] $null causes an error).

Even with the non-generic syntax, PowerShell does select the generic overload by inference, as the following example shows, but only when you pass an actual object rather than $null:

# Try to retrieve a non-existent attribute and provide a [double]
# default value.
# The fact that a [double] instance is returned implies that the
# generic overload was chosen.
#  -> 'System.Double'
$nodes[0].GetAttributeValue('nosuch', [double] $null).GetType().FullName

In the C# source code, the fourth option requires the following condition to work [...]

When you pass $null, the generic overload is not considered - and cannot be, in the absence of type information - so this doesn't make a difference.


[1] As of this writing, v7.3 hasn't been released yet, but preview versions are available - see the repo.

  • Related