I am converting Strings with weird symbols I don't want into Latin-1 (or at least, what Microsoft made of it) and back into a string. I use PowerShell, but this is only about the .NET Methods:
$bytes = [System.Text.Encoding]::GetEncoding(1252).GetBytes($String)
$String = [System.Text.Encoding]::GetEncoding(1252).GetString($bytes)
This works pretty weird, except the weird symbols don't get removed, but question marks are created, for example:
"Helloäöü?→"
becomes
"Helloäöü?????"
What I want is to only convert valid bytes, without creating question marks, so the output will be:
"Helloäöü?"
Is that possible? I searched a bit already, but couldn't find anything. ChatGPT lies to me and says there would be a "GetValidBytes" method, but there isn't...
CodePudding user response:
Use a regex-based -replace
operation based on named Unicode blocks
"Helloäöü?→" -creplace '[^\p{IsBasicLatin}\p{IsLatin-1Supplement}]'
Given that your input already is a .NET string (and therefore composed of UTF-16 code units), there's no strict need for conversion to and from bytes.
Instead, the above removes characters that don't fall into the ISO-8859-1 Unicode subrange, which is mostly the same as Windows-1252.
However, there are certain characters in Windows-1252 not present in ISO-8859-1, notably €
.
However, if you do need to cover all characters of the Windows-1252 encoding, more work is needed:
One solution is to individually add the Windows-1252 characters missing from ISO-8859-1 to the character class (
[...]
)above:€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ
A solution that works with other encodings too is to adapt your to-and-from-bytes encoding approach to remove all characters that can't be represented in the target encoding, using a
System.Text.EncoderReplacementFallback
instance initialized with the empty string.
# Note the use of `€`, which should be preserved.
$string = "Helloäöü>>€<<?→"
$encoding = [System.Text.Encoding]::GetEncoding(
1252,
# Replace non-Windows-1252 chars. with '' (empty string), i.e. *remove* them.
[System.Text.EncoderReplacementFallback]::new(''),
[System.Text.DecoderFallback]::ExceptionFallback # not relevant here
)
$string = $encoding.GetString($encoding.GetBytes($string))