Home > OS >  Python standard IO under Windows PowerShell and CMD
Python standard IO under Windows PowerShell and CMD

Time:11-14

I have the following two-line Python (v. 3.10.7) program "stdin.py":

    import sys
    print(sys.stdin.read())

and the following one-line text file "ansi.txt" (CP1252 encoding) containing:

    ‘I am well’ he said. 

Note that the open and close quotes are 0x91 and 0x92, respectively. In Windows-10 cmd mode the behavior of the Python code is as expected:

    python stdin.py < ansi.txt  # --> ‘I am well’ he said.

On the other hand in Windows Powershell:

    cat .\ansi.txt | python .\stdin.py  # --> ?I am well? he said.

Apparently the CP1252 characters are seen as non-printable characters in the combination Python/PowerShell. If I replace in "stdin.py" the standard input by file input, Python prints correctly the CP1252 quote characters to screen. PowerShell by itself recognizes and prints correctly 0x91 and 0x92.

Questions: can somebody explain to me why cmd works differently than PowerShell in combination with Python? Why doesn't Python recognize the CP1252 quote characters 0x91 and 0x92 when they are piped into it by PowerShell?

CodePudding user response:

tl;dr

Use the $OutputEncoding preference variable:

  • In Windows PowerShell:
# Using the system's legacy ANSI code page, as Python does by default.
# NOTE: The & { ... } enclosure isn't strictly necessary, but 
#       ensures that the $OutputEncoding change is only temporary,
#       by limiting to the child scope that the enclosure cretes.
& {
 $OutputEncoding = [System.Text.Encoding]::Default
 "‘I am well’ he said." | python -c 'import sys; print(sys.stdin.read())'
}

# Using UTF-8 instead, which is generally preferable.
# Note the `-X utf8` option (Python 3.7 )
& {
 $OutputEncoding = [System.Text.UTF8Encoding]::new()
 "‘I am well’ he said." | python -X utf8 -c 'import sys; print(sys.stdin.read())'
}
# Using the system's legacy ANSI code page, as Python does by default.
# Note: In PowerShell (Core) / .NET 5 ,
#       [System.Text.Encoding]::Default` now reports UTF-8, 
#       not the active ANSI encoding.
& {
 $OutputEncoding = [System.Text.Encoding]::GetEncoding([cultureinfo]::CurrentCulture.TextInfo.ANSICodePage)
 "‘I am well’ he said." | python -c 'import sys; print(sys.stdin.read())'
}

# Using UTF-8 instead, which is generally preferable.
# Note the `-X utf8` option (Python 3.7 )
# NO need to set $OutputEncoding, as it now *defaults* to UTF-8
"‘I am well’ he said." | python -X utf8 -c 'import sys; print(sys.stdin.read())'

Note:

  • $OutputEncoding controls what encoding is used to send data TO external programs via the pipeline (to stdin). It defaults to ASCII(!) in Windows PowerShell, and UTF-8 in PowerShell (Core).

  • [Console]::OutputEncoding controls how data received FROM external programs (via stdout) is decoded. It defaults to the console's active code page, which in turn defaults to the system's legacy OEM code page, such as 437 on US-English systems).

That these two encodings are not aligned by default is unfortunate; while Windows PowerShell will see no more changes, there is hope for PowerShell (Core): it would make sense to have it default consistently to UTF-8:

  • GitHub issue #7233 suggests at least defaulting the shortcut files that launch PowerShell to UTF-8 (code page 65001); GitHub issue #14945 more generally discusses the problematic mismatch.

  • In Windows 10 and above, there is an option to switch to UTF-8 system-wide, which then makes both the OEM and ANSI code pages default to UTF-8 (65001); however, this has far-reaching consequences and is still labeled as being in beta as of Windows 11 - see this answer.


Background information:

It is the $OutputEncoding preference variable that determines what character encoding PowerShell uses to send data (invariably text, as of PowerShell 7.3) to an external program via the pipeline.

  • Note that this even applies when data is read from a file: PowerShell, as of v7.3, never sends raw bytes through the pipeline: it reads the content into .NET strings first and then re-encodes them based on $OutputEncoding on sending them through the pipeline to an external program.

  • Therefore, what encoding your ansi.txt input file uses is ultimately irrelevant, as long as PowerShell decodes it correctly when reading it into .NET strings (which are internally composed of UTF-16 code units).

  • See this answer for more information.

Thus, the character encoding stored in $OutputEncoding must match the encoding that the target program expects.

By default the encoding in $OutputEncoding is unrelated to the encoding implied by the console's active code page (which itself defaults to the system's legacy OEM code page, such as 437 on US-English systems), which is what at least legacy console applications tend to use; however, Python does not, and uses the legacy ANSI code page; other modern CLIs, notably NodeJS' node.exe, always use UTF-8.

While $OutputEncoding's default in PowerShell (Core) 7 is now UTF-8, Windows PowerShell's default is, regrettably, ASCII(!), which means that non-ASCII characters get "lossily" transliterated to verbatim ASCII ? characters, which is what you saw.

Therefore, you must (temporarily) set $OutputEncoding to the encoding that Python expects and/or ask it use UTF-8 instead.

  • Related