Having a little trouble constructing a Powershell Replace
regex that's not too greedy.
Looking to convert occurrences of this pattern: /sites/*/*/SitePages/*/*.aspx
to: /sites/*/*/SitePages/*/*.html
But having an issue where there's multiple values on the one line to be replaced. replace
's greediness is capturing the whole line, replacing only the last.
sample input:
<div style="padding-right: 10px"><div ><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div style="width:100%"><div role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 >Jenkins Integration with Deployment Tools</h1>
failing regex segment:
% { $_ -Replace '(sites.*SitePages.*)\.aspx' , '${1}.html' }
Suggestions?
(motivation: I am trying to convert the aspx page references to html as we've moved from hosting on SharePoint. Pages are all static, so no issues, other than converting the page extensions)
CodePudding user response:
try
[string]$string = "<div class='ms-wikicontent ms-rtestate-field' style='padding-right: 10px'><div class='ExternalClass8E56354CC4314DBA861E187B689F3A2B'><table id='layoutsTable' style='width:100%'><tbody><tr style='vertical-align:top'><td style='width:100%'><div class='ms-rte-layoutszone-outer' style='width:100%'><div class='ms-rte-layoutszone-inner' role='textbox' aria-haspopup='true' aria-autocomplete='both' aria-multiline='true'><a id='0::Home|Home' class='ms-wikilink' href='/sites/Team/Project/SitePages/Home.aspx'>Home</a> - <a id='1::Jenkins|Jenkins' class='ms-wikilink' href='/sites/Team/Project/SitePages/Jenkins.aspx'>Jenkins</a><h1 class='ms-rteElement-H1'>Jenkins Integration with Deployment Tools</h1>"
$string.Replace('.aspx','.html')
or if you looking for build regex. Check out https://rubular.com/ it helps to build regex expressions
CodePudding user response:
Daniel already showed an excellent solution using character exclusion [^/]
:
$_ -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html'
Alternatively you could use the lazy modifier ?
:
$_ -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html'
While the latter looks cleaner, it is less performant, because it requires more backtracking.
I did a little benchmark:
$text = '<div style="padding-right: 10px"><div ><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div style="width:100%"><div role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 >Jenkins Integration with Deployment Tools</h1>'
$runs = 100000
$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', 'html' }}).TotalMilliseconds
$lazyMillis = (Measure-Command { foreach( $i in 1..$runs ) { $text -replace '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', 'html' }}).TotalMilliseconds
[PSCustomObject]@{
RegExExclude = '{0} ms' -f [int]$excludeMillis
RegExLazy = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}
Output from PS 7.2:
RegExExclude RegExLazy
------------ ---------
281 ms 350 ms (125%)
The difference is noticable, but not that big, so you may go for readability if performance doesn't matter.
The performance difference between the two becomes even smaller when using a compiled RegEx:
$text = '<div style="padding-right: 10px"><div ><table id="layoutsTable" style="width:100%"><tbody><tr style="vertical-align:top"><td style="width:100%"><div style="width:100%"><div role="textbox" aria-haspopup="true" aria-autocomplete="both" aria-multiline="true"><a id="0::Home|Home" href="/sites/Team/Project/SitePages/Home.aspx">Home</a> - <a id="1::Jenkins|Jenkins" href="/sites/Team/Project/SitePages/Jenkins.aspx">Jenkins</a><h1 >Jenkins Integration with Deployment Tools</h1>'
$runs = 100000
$rxExclude = [regex]::new( '(?<=/sites/[^/]*/[^/]*/SitePages/[^/]*)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )
$rxLazy = [regex]::new( '(?<=/sites/.*?/.*?/SitePages/.*?)aspx', [Text.RegularExpressions.RegexOptions]::Compiled )
$excludeMillis = (Measure-Command { foreach( $i in 1..$runs ) { $rxExclude.Replace( $text, 'html' ) }}).TotalMilliseconds
$lazyMillis = (Measure-Command { foreach( $i in 1..$runs ) { $rxLazy.Replace( $text, 'html' ) }}).TotalMilliseconds
[PSCustomObject]@{
RegExExclude = '{0} ms' -f [int]$excludeMillis
RegExLazy = '{0} ms ({1}%)' -f [int]$lazyMillis, [int]($lazyMillis / $excludeMillis * 100)
}
Output from PS 7.2:
RegExExclude RegExLazy
------------ ---------
160 ms 178 ms (111%)