Home > Enterprise >  Powershell regex multiple match per line
Powershell regex multiple match per line

Time:06-04

Having a little trouble constructing a Powershell Replace regex that's not too greedy.

Looking to convert occurrences of this pattern: /sites/*/*/SitePages/*/*.aspx to: /sites/*/*/SitePages/*/*.html

But having an issue where there's multiple values on the one line to be replaced. replace's greediness is capturing the whole line, replacing only the last.

sample input:

<div style="padding-right: 10px"><div ><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div style="width:100%"><div role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 >Jenkins Integration with Deployment Tools</h1>

failing regex segment:

% { $_ -Replace '(sites.*SitePages.*)\.aspx' , '${1}.html' }

Suggestions?

(motivation: I am trying to convert the aspx page references to html as we've moved from hosting on SharePoint. Pages are all static, so no issues, other than converting the page extensions)

CodePudding user response:

try

[string]$string = "<div class='ms-wikicontent ms-rtestate-field' style='padding-right: 10px'><div class='ExternalClass8E56354CC4314DBA861E187B689F3A2B'><table id='layoutsTable' style='width:100%'><tbody><tr style='vertical-align:top'><td style='width:100%'><div class='ms-rte-layoutszone-outer' style='width:100%'><div class='ms-rte-layoutszone-inner' role='textbox' aria-haspopup='true' aria-autocomplete='both' aria-multiline='true'><a id='0::Home|Home' class='ms-wikilink' href='/sites/Team/Project/SitePages/Home.aspx'>Home</a> - <a id='1::Jenkins|Jenkins' class='ms-wikilink' href='/sites/Team/Project/SitePages/Jenkins.aspx'>Jenkins</a><h1 class='ms-rteElement-H1'>Jenkins Integration with Deployment Tools</h1>"

$string.Replace('.aspx','.html')

or if you looking for build regex. Check out https://rubular.com/ it helps to build regex expressions

CodePudding user response:

Daniel already showed an excellent solution using character exclusion [^/]:

$_ -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html'

Alternatively you could use the lazy modifier ?:

$_ -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html'

While the latter looks cleaner, it is less performant, because it requires more backtracking.

I did a little benchmark:

$text = '<div  style="padding-right: 10px"><div ><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div  style="width:100%"><div  role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home"  href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins"  href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 >Jenkins Integration with Deployment Tools</h1>'

$runs = 100000
$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html' }}).TotalMilliseconds
$lazyMillis    = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html' }}).TotalMilliseconds

[PSCustomObject]@{
    RegExExclude = '{0} ms'        -f [int]$excludeMillis
    RegExLazy    = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}

Output from PS 7.2:

RegExExclude RegExLazy    
------------ ---------
281 ms       350 ms (125%)

The difference is noticable, but not that big, so you may go for readability if performance doesn't matter.


The performance difference between the two becomes even smaller when using a compiled RegEx:

$text = '<div  style="padding-right: 10px"><div ><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div  style="width:100%"><div  role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home"  href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins"  href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 >Jenkins Integration with Deployment Tools</h1>'

$runs = 100000

$rxExclude = [regex]::new( '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )
$rxLazy    = [regex]::new( '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )

$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $rxExclude.Replace( $text, 'html' ) }}).TotalMilliseconds
$lazyMillis    = (Measure-Command { foreach( $i in 1..$runs ) { $rxLazy.Replace( $text, 'html' ) }}).TotalMilliseconds

[PSCustomObject]@{
    RegExExclude = '{0} ms'        -f [int]$excludeMillis
    RegExLazy    = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}

Output from PS 7.2:

RegExExclude RegExLazy
------------ ---------
160 ms       178 ms (111%)
  • Related