Replace in a string all characters outside the set Windows-1252-CodePudding

Having to maintain old programs written in VB6, I find myself having this issue.

I need to find an efficient way to search a string for all characters OUTSIDE the Windows-1252 set and replace them with "_". I can do this in C#

So far I have done this by creating a string with all 1252 characters, is there a faster way?

I may have to do this for a few million records in a text file

string 1252chars = ""!\""#$%&'()* ,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿŸžœ›š™˜—–•""’’ŽŽ‹Š‰vˆ‡†…„ƒ‚€ ""

//Replace all characters not in the string above...

CodePudding user response：

Have you tried to normalize the string? string.Normalize() method is used to remove all characters that are not part of the Windows-1252 character set. https://learn.microsoft.com/de-de/dotnet/api/system.string.normalize?view=net-7.0

string inputString = "Some input string";
string outputString = inputString.Normalize(NormalizationForm.FormD);

Alternatively, you can use a loop to check each character of the string and remove the characters that are not in the Windows-1252 set using the StringBuilder class.

string inputString = "Some input string";
StringBuilder sb = new StringBuilder();
foreach (char c in inputString)
{
    if (c <= '\u00FF')
    {
        sb.Append(c);
    }
}
string outputString = sb.ToString();

CodePudding user response：

The Encoding class can achieve this, most likely very efficiently. When converting to and from the encoding, a replacement character can be specified.

using System;
using System.Text;
                    
public class Program
{
    public static void Main()
    {
        // For .NET core only:
        // Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

        var text = "abc絵de