I want to extract this text
Spectrum Mortis - Bit Meseri - The Incantation (2022)
Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)
from this html block
<span id='tid-span-369523'><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
I'm trying to set this code but nothing is written on output2.txt
$html = Get-Content -Path 'C:\temp\html\metalarea2.html' -Raw
$pattern = '<span id="tid-span-\\d "><a id="tid-link-\\d " href=". ?" title=". ?">(. ?)</a></span>'
$matches = Select-String -InputObject $html -Pattern $pattern -AllMatches
$result = $matches | % { $_.Matches } | % { $_.Groups[1].Value }
$result | Out-File -FilePath "C:\temp\html\output2.txt"
I don't understand where the problem lies
EDIT: SOLUTIONS
$pattern = '<span id=\x27tid-span-\d \x27><a id="tid-link-\d " href=". ?" title=". ?">(. ?)</a></span>'
OR
$pattern = '<a id="tid-link-\d ". ?>(. ?)</a>'
CodePudding user response:
You can use below regular expression to capture plain text between HTML tags:
(<[^>]*>) (?<plaintext>[^<] )<\/[^>]*>
You can refer to this example from regex101.com: Live sample
Here is a full script example:
$html = @"
<span id="tid-span-369523"><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
<div id="something">Text within div</div>
"@
$pattern = '(<[^>]*>) (?<plaintext>[^<] )<\/[^>]*>'
$options = [System.Text.RegularExpressions.RegexOptions]::Multiline
$matches = [regex]::Matches($html, $pattern, $options)
$results = $matches | %{ $_.Groups["plaintext"].Value }
$results
CodePudding user response:
It is generally a bad idea to peek and/or poke in structured text using regular expressions. Instead, it is better to use a proper (html) parser to manipulate your data.
To give you an example using the IHTMLDocument2 interface
:
$Html = @'
<html>
<head>
<title>Title</title>
</head>
<body>
<span id="tid-span-369523"><a id="tid-link-369523" href="http://metalarea.org/forum/index.php?showtopic=369523" title="This topic was started: Sep 16 2022, 04:18:47">Spectrum Mortis - Bit Meseri - The Incantation (2022)</a></span>
<span id='tid-span-221568'><a id="tid-link-221568" href="http://metalarea.org/forum/index.php?showtopic=221568" title="This topic was started: Apr 11 2014, 14:31:18">Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)</a></span>
<div id="something">Text within div</div>
</body>
</html>
'@
function ParseHtml($String) {
$Unicode = [System.Text.Encoding]::Unicode.GetBytes($String)
$Html = New-Object -Com 'HTMLFile'
if ($Html.PSObject.Methods.Name -Contains 'IHTMLDocument2_Write') {
$Html.IHTMLDocument2_Write($Unicode)
}
else {
$Html.write($Unicode)
}
$Html.Close()
$Html
}
$Document = ParseHtml $Html
$Document.getElementsByTagName('a') |
Where-Object { $_.id -Like 'tid-link-*' } |
Foreach-Object { $_.innerText }
Spectrum Mortis - Bit Meseri - The Incantation (2022)
Hate Legions - Exitus Letalis (Tota Vita Nihil Aliud Quam Ad Mortem Iter Est) (2014)