Home > Software engineering >  PowerShell - escaping fancy single and double quotes for regex and string replace
PowerShell - escaping fancy single and double quotes for regex and string replace

Time:12-03

I'm working with HTML files created by Acrobat, which doesn't use proper HTML entities to escape Unicode characters. I need to include single and double right quotation marks in a regex pattern, but every attempt I've made at escaping these characters has failed in my script...even if it works from a regular PowerShell session.

For example, this find/replace does not work:

    $html = $html.Replace("`“", '“')
    $html = $html.Replace("`”", '”')
    $html = $html.Replace("`‘", '‘')
    $html = $html.Replace("`’", '’')

...but it does work if I break into my script and run one of those replace lines from the debug prompt.

Edit: Here's a snippet of the markup I'm testing with right now:

<p style="padding-left: 5pt;text-indent: 17pt;line-height: 119%;text-align: justify;">To guide its readers the Hermetica makes use of the mystical astrological world-view that we have been discussing. It describes the creation of the world as a series of emanations, starting with the Light, who gave birth to a son called Logos. In the words of Hermes’s guide, Poimandres:</p><p style="padding-left: 24pt;text-indent: 0pt;line-height: 119%;text-align: justify;">“That Light,” he said, “is I, even Mind, the first God, who was before the watery substance which appeared out of the darkness; and the Logos which came forth the Light is son of God.”</p><p style="padding-left: 21pt;text-indent: 1pt;line-height: 119%;text-align: justify;">(Scott, Walter, translator, Hermetica: The Ancient Greek and Latin Writings Which Contain Religious or Philosophical Teachings Ascribed to Hermes Trismegistus, Boston: Shambhala: 1985, p. 117)</p>

If $html equals that string, my attempts to find and replace the characters appear to be futile.

CodePudding user response:

Try using the Unicode values instead of backquoting the literal:

    $html = $html.Replace("`u{201C}", '&ldquo;')
    $html = $html.Replace("`u{201D}", '&rdquo;')
    $html = $html.Replace("`u{2018}", '&lsquo;')
    $html = $html.Replace("`u{2019}", '&rsquo;')

Produces

enter image description here

If you're having problems with encoding (UTF-8, for example, as you suggested), take a look at https://unicode-table.com - you can get the code values for any encoding.

CodePudding user response:

Evidently, PowerShell does funny things with non-BOM UTF-8 encoding. Setting VSCode to auto-encode PowerShell scripts as UTF-8 with BOM allows the String.Replace function to operate as expected.

  • Related