Home > Software engineering >  Parsing HTML with <DIV> class to variable
Parsing HTML with <DIV> class to variable

Time:11-11

I am trying to parse a server monitoring page which doesnt have any class name . The HTML file looks like this

<div style="float:left;margin-right:50px"><div>Server:VIP Owner</div><div>Server Role:ACTIVE</div><div>Server State:AVAILABLE</div><div>Network State:GY</div>

how do i parse this html content to a variable like

$Server VIP Owner
$Server_Role Active
$Server_State Available

Since there is no class name.. i am struggling to get this extracted.

 $htmlcontent.ParsedHtml.getElementsByTagName('div') | ForEach-Object {
>>     New-Variable -Name $_.className -Value $_.textContent

CodePudding user response:

While you are only showing us a very small part of the HTML, it is very likely there are more <div> tags in there.

Without an id property or anything else that uniquely identifies the div you are after, you can use a Where-Object clause to find the part you are looking for.

Try

$div = ($htmlcontent.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div>Server Name:*' }).outerText

# if you're on PowerShell version < 7.1, you need to replace the (first) colons into equal signs
$result = $div -replace '(?<!:.*):', '=' | ConvertFrom-StringData

# for PowerShell 7.1, you can use the `-Delimiter` parameter
#$result = $div | ConvertFrom-StringData -Delimiter ':'

The result is a Hashtable like this:

Name                           Value
----                           -----
Server Name                    VIP Owner
Server State                   AVAILABLE
Server Role                    ACTIVE
Network State                  GY

Of course, if there are more of these in the report, you'll have to loop over divs with something like this:

$result = ($htmlcontent.ParsedHtml.getElementsByTagName('div') | Where-Object { $_.InnerHTML -like '<div>Server Name:*' }) | Foreach-Object {
    $_.outerText -replace '(?<!:.*):', '=' | ConvertFrom-StringData
}

Ok, so the original question did not show what we are dealing with..
Apparently, your HTML contains divs like this:

  <div>=======================================</div>
  <div>Service Name:MysqlReplica</div>
  <div>Service Status:RUNNING</div>
  <div>Remarks:Change role completed in 1 ms</div>
  <div>=======================================</div>
  <div>Service Name:OCCAS</div>
  <div>Service Status:RUNNING</div>
  <div>Remarks:Change role completed in 30280 ms</div>

To deal with blocks like that, you need a whole different approach:

# create a List object to store the results
$result  = [System.Collections.Generic.List[object]]::new()
# create a temporary ordered dictionary to build the resulting items
$svcHash = [ordered]@{}

foreach ($div in $htmlcontent.ParsedHtml.getElementsByTagName('div')) {
    switch -Regex ($div.InnerText) {
        '^= ' { 
            if ($svcHash.Count) {
                # add the completed object to the list
                $result.Add([PsCustomObject]$svcHash)
                $svcHash = [ordered]@{}
            }
        }
        '^(Service . |Remarks):' { 
            # split into the property Name and its value
            $name, $value = ($_ -split ':',2).Trim() 
            $svcHash[$name] = $value 
        }
    }
}
if ($svcHash.Count) {
    # if we have a final service block filled. This happens when no closing
    #   <div>=======================================</div>
    # was found in the HTML, we need to add that to our final array of PSObjects
    $result.Add([PsCustomObject]$svcHash)
}

# output on screen
$result | Format-Table -AutoSize

# output to CSV file
$result | Export-Csv -Path 'X:\services.csv' -NoTypeInformation

Output on screen using the above example:

Service Name Service Status Remarks                          
------------ -------------- -------                          
MysqlReplica RUNNING        Change role completed in 1 ms    
OCCAS        RUNNING        Change role completed in 30280 ms
  • Related