I have a text file input.xlf
<trans-unit id="loco:5e7257a0c38e0f5b456bae94">
<source>Login</source>
<target>登入</target>
<note>Login Header</note>
</trans-unit>
Basically I need to replace <
with <
and >
with '>', so I run below script
runner.bat
powershell -Command "(gc input.xlf) -replace '<', '<' | Out-File -encoding ASCII output.xlf";
powershell -Command "(gc output.xlf) -replace '>', '>' | Out-File -encoding ASCII output.xlf";
The above was working until I noticed below as the output
<trans-unit id="loco:5e7257a0c38e0f5b456bae94">
<source>Login</source>
<target>??????</target>
<note>Login Header</note>
</trans-unit>
I tried removing the encoding but now I get
<trans-unit id="loco:5e7257a0c38e0f5b456bae94">
<source>Login</source>
<target>登入</target>
<note>Login Header</note>
</trans-unit>
Below is my desired output
<trans-unit id="loco:5e7257a0c38e0f5b456bae94">
<source>Login</source>
<target>登入</target>
<note>Login Header</note>
</trans-unit>
CodePudding user response:
There are (potentially) two character-encoding problems:
On output, using
-Encoding Ascii
is guaranteed to "lossily" transliterate any non-ASCII-range characters to literal?
characters.- To preserve all characters, you must choose a Unicode encoding, such as
-Encoding Utf8
- To preserve all characters, you must choose a Unicode encoding, such as
On input, you must ensure that the input file is correctly read by PowerShell.
- Specifically, Windows PowerShell misinterprets BOM-less UTF-8 files as ANSI-encoded, so you need to use
-Encoding Utf8
withGet-Content
too.
- Specifically, Windows PowerShell misinterprets BOM-less UTF-8 files as ANSI-encoded, so you need to use
Additionally, you can get away with a single powershell.exe
call, and you can additionally optimize this call:
powershell -Command "(gc -Raw -Encoding utf8 input.xlf) -replace '<', '<' -replace '>', '>' | Set-Content -NoNewLine -Encoding Utf8 output.xlf"
Using
-Raw
withgc
(Get-Content
) reads the file as a whole instead of into an array of lines, which speeds up the-replace
operations.You can chain
-replace
operationsWith input that is already text (strings),
Set-Content
is generally the faster choice.[1]
-NoNewLine
prevents an extra trailing newline from getting appended.
[1] It will make virtually no difference here, given that only a single string is written, but with many input strings (line-by-line output) it may - see this answer.