I have a strangeness problem. In Windows7 operation system,I try to run the command in powershell.
ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'" > test.txt
When I read test.txt file:
ruby -E UTF-8 -e "puts gets" < test.txt
the result is:
�i0F0^0�0�0W0O0J0X�D0W0~0Y0,Mr Jason
I check test.txt file,find the file type encoding is Unicode,not UTF-8.
What should I do ?
How should I ensure the encoding of the output file type after redirection? Please help me.
CodePudding user response:
tl;dr
Unfortunately, the solution (on Windows) is much more complicated than one would hope:
# Make PowerShell both send and receive data as UTF-8 when talking to
# external (native) programs.
# Note:
# * In *PowerShell (Core) 7 *, $OutputEncoding *defaults* to UTF-8.
# * You may want to save and restore the original settings.
$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::new()
# Create a BOM-less UTF-8 file.
# Note: In *PowerShell (Core) 7 *, you can less obscurely use:
# ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'" | Set-Content test.txt
$null = New-Item -Force test.txt -Value (
ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'"
)
# Pipe the resulting file back to Ruby as UTF-8, thanks to $OutputEncoding
# Note that PowerShell has NO "<" operator - stdin input must be provided
# via the pipeline.
Get-Content -Raw test.txt | ruby -E UTF-8 -e "puts gets"
In terms of character encoding, PowerShell communicates with external (native) programs via two settings that contain .NET
System.Text.Encoding
instances:$OutputEncoding
specifies the encoding to use to send data TO an external program via the pipeline.[Console]::OutputEncoding
specifies the encoding to interpret (decoded) data FROM an external program('s stdout stream); for decoding to work as intended, this setting must match the external program's actual output encoding.
As of PowerShell 7.3.1, PowerShell only "speaks text" when communicating with external programs, and an intermediate decoding and re-encoding step is invariably involved - even when you're just using
>
(effectively an alias of theOut-File
cmdlets) to send output to a file.That is, PowerShell's pipelines are NOT raw byte conduits the way the are in other shells.
- See this answer for workarounds and potential future raw-byte support.
Whatever output operator (
>
) or cmdlet (Out-File
,Set-Content
) you use will use its default character encoding, which is unrelated to the encoding of the original input, which has already been decoded into .NET strings when the operator / cmdlet operates on it.>
/Out-File
in Windows PowerShell defaults to "Unicode" (UTF-16LE) encoding, which is what you saw.While
Out-File
andSet-Content
have an-Encoding
parameter that allows you to control the output encoding, in Windows PowerShell they don't allow you to create BOM-less UTF-8 files; curiously,New-Item
does create such files, which is why it is used above; if a UTF-8 BOM is acceptable,... | Set-Content -Encoding utf8
will do in Windows PowerShell.Note that, by contrast, PowerShell (Core) 7 , the modern, cross-platform edition now thankfully consistently defaults to BOM-less UTF-8.
- That said, with respect to
[Console]::OutputEncoding
on Windows, it still uses the legacy OEM code page by default as of v7.3.1, which means that UTF-8 output from external programs is by default misinterpreted - see GitHub issue #7233 for a discussion.
- That said, with respect to