Home > database >  What 's the reason of Ruby IO stream in Windows powershell file type encoding
What 's the reason of Ruby IO stream in Windows powershell file type encoding

Time:01-04

I have a strangeness problem. In Windows7 operation system,I try to run the command in powershell.

ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'" > test.txt

When I read test.txt file:

ruby -E UTF-8 -e "puts gets" < test.txt

the result is:

�i0F0^0�0�0W0O0J0X�D0W0~0Y0,Mr Jason

I check test.txt file,find the file type encoding is Unicode,not UTF-8.

What should I do ?

How should I ensure the encoding of the output file type after redirection? Please help me.

CodePudding user response:

tl;dr

Unfortunately, the solution (on Windows) is much more complicated than one would hope:

# Make PowerShell both send and receive data as UTF-8 when talking to
# external (native) programs.
# Note: 
#  * In *PowerShell (Core) 7 *, $OutputEncoding *defaults* to UTF-8.
#  * You may want to save and restore the original settings.
$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::new()
 
# Create a BOM-less UTF-8 file.
# Note: In *PowerShell (Core) 7 *, you can less obscurely use:
#   ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'" | Set-Content test.txt
$null = New-Item -Force test.txt -Value (
  ruby -E UTF-8 -e "puts 'どうぞよろしくお願いします,Mr Jason'"
)

# Pipe the resulting file back to Ruby as UTF-8, thanks to $OutputEncoding
# Note that PowerShell has NO "<" operator - stdin input must be provided
# via the pipeline.
Get-Content -Raw test.txt | ruby -E UTF-8 -e "puts gets"

  • In terms of character encoding, PowerShell communicates with external (native) programs via two settings that contain .NET System.Text.Encoding instances:

    • $OutputEncoding specifies the encoding to use to send data TO an external program via the pipeline.

    • [Console]::OutputEncoding specifies the encoding to interpret (decoded) data FROM an external program('s stdout stream); for decoding to work as intended, this setting must match the external program's actual output encoding.

  • As of PowerShell 7.3.1, PowerShell only "speaks text" when communicating with external programs, and an intermediate decoding and re-encoding step is invariably involved - even when you're just using > (effectively an alias of the Out-File cmdlets) to send output to a file.

    • That is, PowerShell's pipelines are NOT raw byte conduits the way the are in other shells.

      • See this answer for workarounds and potential future raw-byte support.
    • Whatever output operator (>) or cmdlet (Out-File, Set-Content) you use will use its default character encoding, which is unrelated to the encoding of the original input, which has already been decoded into .NET strings when the operator / cmdlet operates on it.

      • > / Out-File in Windows PowerShell defaults to "Unicode" (UTF-16LE) encoding, which is what you saw.

      • While Out-File and Set-Content have an -Encoding parameter that allows you to control the output encoding, in Windows PowerShell they don't allow you to create BOM-less UTF-8 files; curiously, New-Item does create such files, which is why it is used above; if a UTF-8 BOM is acceptable, ... | Set-Content -Encoding utf8 will do in Windows PowerShell.

      • Note that, by contrast, PowerShell (Core) 7 , the modern, cross-platform edition now thankfully consistently defaults to BOM-less UTF-8.

        • That said, with respect to [Console]::OutputEncoding on Windows, it still uses the legacy OEM code page by default as of v7.3.1, which means that UTF-8 output from external programs is by default misinterpreted - see GitHub issue #7233 for a discussion.
  • Related