Home > Software engineering >  Replace in a string all characters outside the set Windows-1252
Replace in a string all characters outside the set Windows-1252

Time:01-18

Having to maintain old programs written in VB6, I find myself having this issue.

I need to find an efficient way to search a string for all characters OUTSIDE the Windows-1252 set and replace them with "_". I can do this in C#

So far I have done this by creating a string with all 1252 characters, is there a faster way?

I may have to do this for a few million records in a text file

string 1252chars = ""!\""#$%&'()* ,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶•¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿŸžœ›š™˜—–•""’’ŽŽ‹Š‰vˆ‡†…„ƒ‚€ ""

//Replace all characters not in the string above...

CodePudding user response:

Have you tried to normalize the string? string.Normalize() method is used to remove all characters that are not part of the Windows-1252 character set. https://learn.microsoft.com/de-de/dotnet/api/system.string.normalize?view=net-7.0

string inputString = "Some input string";
string outputString = inputString.Normalize(NormalizationForm.FormD);

Alternatively, you can use a loop to check each character of the string and remove the characters that are not in the Windows-1252 set using the StringBuilder class.

string inputString = "Some input string";
StringBuilder sb = new StringBuilder();
foreach (char c in inputString)
{
    if (c <= '\u00FF')
    {
        sb.Append(c);
    }
}
string outputString = sb.ToString();

CodePudding user response:

The Encoding class can achieve this, most likely very efficiently. When converting to and from the encoding, a replacement character can be specified.

using System;
using System.Text;
                    
public class Program
{
    public static void Main()
    {
        // For .NET core only:
        // Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

        var text = "abc絵de           
  • Related