How can I remove all non-alphabetic characters from a Java string using a regex?-CodePudding

I want to remove all non-alphabetic characters from a String using Java.

This is my input: "-Hello, 1 world$!"

The output should be: "Helloworld"

but instead it's outputting: "Hello1world"

I need some help fixing this as I'm fairly new to programming...

import java.util.Scanner;

public class LabProgram {
    public static String removeNonAlpha (String userString) {
    String[] stringArray = userString.split("\\W ");
        String result = new String();
        
        for(int i = 0; i < stringArray.length;i  ){
            result = result  stringArray[i];
        }
        
        return result;
    }
    
    public static void main(String args[]) {
        Scanner scnr = new Scanner(System.in);
        String str = scnr.nextLine();
        String result = removeNonAlpha(str);
        System.out.println(result);
    }
}

CodePudding user response：

this should work:

import java.util.Scanner;

public class LabProgram {
    public static String removeNonAlpha (String userString) {
        return userString.replaceAll("[^A-Za-z] ", "");
    }
    
    public static void main(String args[]) {
        Scanner scnr = new Scanner(System.in);
        String str = scnr.nextLine();
        String result = removeNonAlpha(str);
        System.out.println(result);
    }
}

CodePudding user response：

Take a look replaceAll(), which expects a regular expression as the first argument and a replacement-string as a second:

return userString.replaceAll("[^\\p{Alpha}]", "");

for more information on regular expressions take a look at this tutorial

CodePudding user response：

You could use:

 public static String removeNonAlpha (String userString) {
    return userString.replaceAll("[^a-zA-Z] ",  "");
}

CodePudding user response：

The issue is that your regex pattern is matching more than just letters, but also matching numbers and the underscore character, as that is what \W does. Replacing this fixes the issue:

String[] stringArray = userString.split("\\P{Alpha} ");

Per the Pattern Javadocs, \W matches any non-word character, where a word character is defined in \w as [a-zA-Z_0-9]. This means that it matches upper and lowercase ASCII letters A - Z & a - z, the numbers 0 - 9, and the underscore character ("_").

The solution would be to use a regex pattern that excludes only the characters you want excluded. Per the pattern documentation could do [^a-zA-Z] or \P{Alpha} to exclude the main 26 upper and lowercase letters. If you want to count letters other than just the 26 ASCII letters (e.g. letters in non-Latin alphabets), you could use \P{IsAlphabetic}.

\p{prop} matches if the input has the property prop, while \P{prop} does not match if the input has that property.

As other answers have pointed out, there are other issues with your code that make it non-idiomatic, but those aren't affecting the correctness of your solution.

CodePudding user response：

\W is equivalent to [a-zA-Z_0-9], so it include numerics caracters.

Just replace it by "[^a-zA-Z] ", like in the below example :

import java.util.Arrays;

class Scratch {
    public static void main(String[] args) {
        String input = "-Hello, 1    world$!";
        System.out.println("Input : "   input);
        String[] split = input.split("[^a-zA-Z] ");
        StringBuilder builder = new StringBuilder();
        Arrays.stream(split).forEach(builder::append);
        System.out.println("Ouput :"   builder);
    }
}

Output :

Input : -Hello, 1    world$!
Ouput :Helloworld

You can have a look at this article for more details about regular expressions : https://www.vogella.com/tutorials/JavaRegularExpressions/article.html#meta-characters