Home > Software design >  How do I Escape Foreign Characters using Regex or StringBuilder?
How do I Escape Foreign Characters using Regex or StringBuilder?

Time:04-04

I have the following method to clean up strings:

public static String UseStringBuilderWithHashSet(string strIn)
    {
        var hashSet = new HashSet<char>("?&^$#@!() -,:;<>’\'-_*");
        // specify capacity of StringBuilder to avoid resizing
        StringBuilder sb = new StringBuilder(strIn.Length);
        foreach (char x in strIn.Where(c => !hashSet.Contains(c)))
        {
            sb.Append(x);
        }
        return sb.ToString();
    }

However, strings such as [MV] REOL ちるちる ChiruChiru or [MV] REOL ヒビカセ Hibikase do not get cleaned up.

How can I modify my method so it can turn one of the above strings into for example: [MV] REOL ChiruChiru

CodePudding user response:

You're trying to solve this exhaustively by filtering out everything you don't want. This is not optimal as their are 100,000 possible characters.

You may find better results if you only accept what you do want.

public static string CleanInput(string input)
{
    //a-zA-Z allows any English alphabet character upper or lower case
    //\[ and \] allows []
    //\s allows whitespace
    var regex = new Regex(@"[a-zA-Z\[\]\s]");
    var stringBuilder = new StringBuilder(input.Length);
    foreach(char c in input){
        if(regex.IsMatch(c.ToString())){
            stringBuilder.Append(c);
        }
    }
    string output = stringBuilder.ToString();
    //\s  will match on any duplicate spaces and replace it with
    //a single space.
    return Regex.Replace(output , @"\s ", " ");
}
  • Related