Home > Software engineering >  Only Convert Valid Bytes with .NET GetBytes Method without creating question marks
Only Convert Valid Bytes with .NET GetBytes Method without creating question marks

Time:12-13

I am converting Strings with weird symbols I don't want into Latin-1 (or at least, what Microsoft made of it) and back into a string. I use PowerShell, but this is only about the .NET Methods:

    $bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes($String)
    $String = [System.Text.Encoding]::GetEncoding(1252).GetString($bytes)

This works pretty weird, except the weird symbols don't get removed, but question marks are created, for example:

"Helloäöü?→"

becomes

"Helloäöü?????"

What I want is to only convert valid bytes, without creating question marks, so the output will be:

"Helloäöü?"

Is that possible? I searched a bit already, but couldn't find anything. ChatGPT lies to me and says there would be a "GetValidBytes" method, but there isn't...

CodePudding user response:

Use a regex-based -replace operation based on named Unicode blocks

"Helloäöü?→" -creplace '[^\p{IsBasicLatin}\p{IsLatin-1Supplement}]'

Given that your input already is a .NET string (and therefore composed of UTF-16 code units), there's no strict need for conversion to and from bytes.

Instead, the above removes characters that don't fall into the ISO-8859-1 Unicode subrange, which is mostly the same as Windows-1252.
However, there are certain characters in Windows-1252 not present in ISO-8859-1, notably .


However, if you do need to cover all characters of the Windows-1252 encoding, more work is needed:

  • One solution is to individually add the Windows-1252 characters missing from ISO-8859-1 to the character class ([...])above:

    • €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
  • A solution that works with other encodings too is to adapt your to-and-from-bytes encoding approach to remove all characters that can't be represented in the target encoding, using a System.Text.EncoderReplacementFallback instance initialized with the empty string.

# Note the use of `€`, which should be preserved.
$string = "Helloäöü>>€<<?→"

$encoding = [System.Text.Encoding]::GetEncoding(
  1252,
  # Replace non-Windows-1252 chars. with '' (empty string), i.e. *remove* them.
  [System.Text.EncoderReplacementFallback]::new(''),
  [System.Text.DecoderFallback]::ExceptionFallback # not relevant here
)

$string = $encoding.GetString($encoding.GetBytes($string))
  • Related