Home > OS >  Extract text from html with powershell - bad pattern
Extract text from html with powershell - bad pattern

Time:01-29

I want to extract this text

Spectrum Mortis - Bit Meseri - The Incantation (2022)
Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)

from this html block

<span id='tid-span-369523'><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>

I'm trying to set this code but nothing is written on output2.txt

$html = Get-Content -Path 'C:\temp\html\metalarea2.html' -Raw

$pattern = '<span id="tid-span-\\d "><a id="tid-link-\\d " href=". ?" title=". ?">(. ?)</a></span>'

$matches = Select-String -InputObject $html -Pattern $pattern -AllMatches
$result = $matches | % { $_.Matches } | % { $_.Groups[1].Value }
$result | Out-File -FilePath "C:\temp\html\output2.txt"

I don't understand where the problem lies

EDIT: SOLUTIONS

$pattern = '<span id=\x27tid-span-\d \x27><a id="tid-link-\d " href=". ?" title=". ?">(. ?)</a></span>'

OR

$pattern = '<a id="tid-link-\d ". ?>(. ?)</a>'

CodePudding user response:

You can use below regular expression to capture plain text between HTML tags:

(<[^>]*>) (?<plaintext>[^<] )<\/[^>]*>

You can refer to this example from regex101.com: Live sample

Here is a full script example:

$html = @"
<span id="tid-span-369523"><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
<div id="something">Text within div</div>
"@

$pattern = '(<[^>]*>) (?<plaintext>[^<] )<\/[^>]*>'
$options = [System.Text.RegularExpressions.RegexOptions]::Multiline

$matches = [regex]::Matches($html, $pattern, $options)

$results = $matches | %{ $_.Groups["plaintext"].Value }

$results

CodePudding user response:

It is generally a bad idea to peek and/or poke in structured text using regular expressions. Instead, it is better to use a proper (html) parser to manipulate your data.

To give you an example using the IHTMLDocument2 interface:

$Html = @'
<html>
    <head>
        <title>Title</title>
    </head>
    <body>
        <span id="tid-span-369523"><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
        <span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
        <div id="something">Text within div</div>
    </body>
</html>
'@
function ParseHtml($String) {
    $Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
    $Html = New-Object -Com 'HTMLFile'
    if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
        $Html.IHTMLDocument2_Write($Unicode)
    } 
    else {
        $Html.write($Unicode)
    }
    $Html.Close()
    $Html
}

$Document = ParseHtml $Html
$Document.getElementsByTagName('a') |
    Where-Object { $_.id -Like 'tid-link-*' } |
    Foreach-Object { $_.innerText }
Spectrum Mortis - Bit Meseri - The Incantation (2022)
Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)
  • Related