javac with Powershell Core creates ANSI .class files from UTF8 source-CodePudding

Powershell Core 7 is apparently natively BOM-less UTF8 but it still performs exactly the same as Windows Powershell when I use javac on any UTF8 source file that contains accented characters : it encodes the .class file with ANSI character encodings.

For example, this simple program PremierProg.java :

public class PremierProg
{
    public static void main( String[] args )
    {
        System.out.println("Je suis déterminé à apprendre comment coder.");
    }
}

will be compiled then executed with the following output in pwsh :

Je suis dÃ©terminÃ© Ã apprendre comment coder.

I can very obviously add the -encoding "UTF-8" option to my javac call, but isn't the point of cross-platform not having to do any of that? It is actually easier to type wsl javac [MySource.java] and have it output the correct .class file. The same versions of openjdk are installed on both the Windows and Ubuntu sides.

Powershell does read the file correctly as UTF8 :

pwsh reads utf-8

but still interacts with javac using ANSI (even though other forever-utf8-native shells like bash don't have this issue).

Does anyone know why Powershell - even the cross-platform Core version - behaves this way? I really don't want to add anything to a profile.ps1 file or to the javac call. I would just like something that is UTF8-native to fully behave that way.

At the moment, I am getting my students to run bash (via wsl) as their default shell, and that's fine, but this problem still bothers me. I'd like to understand what is going on, and if the solution is at all reasonable, fix it.

The reason for not wanting a profile.ps1 file or extra parameters in the javac call is because this needs to run on school-board controlled devices where scripts are disabled and I am dealing with a majority of first-time programmers.

CodePudding user response：

Thanks to @Slaw 's comments to the original question, the solution actually has nothing to do with PowerShell or any other console but with the platform (Windows, MacOS, Linux) and the JDK.

Java 18 will actually no longer default to the platform charset but to UTF-8, so that will ultimately eliminate this issue.

In the meantime, the best solution seems to be adding the -encoding UTF-8 option to the javac call. Students can save time by using the arrows to retrieve the longer command from their history rather than typing it out each time they need to compile. This solution will still be useful after Java 18 is released simply because it is clear and explicit at the slight cost of being longer.

CodePudding user response：

To complement your own answer:

Windows 10 offers a - still-in-beta - option to use UTF-8 system-wide (meaning that both the OEM and the ANSI code page are set to 65001, which is UTF-8). While activating this option has the potential to make encoding headaches go away - not just with javac (the active ANSI code page it uses will then effectively be UTF-8), but also with PowerShell in general (see below) - it also has far-reaching consequences - see this answer.
If activating system-wide UTF-8 support is not an option, you could work around the problem by defining a wrapper function for javac that hard-codes -encoding utf8 while passing all other arguments through, and place that in your $PROFILE file, so that it becomes available by default in all future sessions:

function javac { javac.exe -encoding utf8 $args }

^{Note: Functions have higher command precedence than external programs, so when you submit javac, the function is called. If you also wanted to define javac.exe as a wrapper, you could add Set-Alias javac.exe javac, and redefined the function body as & (Get-Command -Type Application javac.exe) -encoding utf8 $args}

Also note that there are PowerShell-specific character-encoding considerations:

As of PowerShell (Core) 7.2, PowerShell console windows unfortunately still default to the system's legacy OEM code page, as reflected in the output from chcp, and, in .NET, [Console]::InputEncoding and [Console]::OutputEncoding.
- If a given external program outputs text in a different encoding - e.g. UTF-8, [Console]::Encoding must first be set to that encoding in order for PowerShell to decode the output correctly.
- Caveat: An encoding mismatch may go unnoticed in direct-to-display output, but will surface when PowerShell processes the output, such as sending it through the pipeline, redirecting it to a file, or capturing it in a variable.
Conversely, the $OutputEncoding preference variable determines what encoding PowerShell uses to send data to external programs, via the pipeline.
- Windows PowerShell regrettably defaults to ASCII(!), with any non-ASCII-range characters getting transcoded "lossily" to literal ? chars.
- PowerShell (Core) 7 now more sensibly defaults to UTF-8 - although, as stated, on de-coding output it still defaults to the system's OEM code page.

See this answer for a more detailed discussion of PowerShell's encoding behavior and links to helper functions.