Want to parse a string only want specific parts in the string using regex-CodePudding

The string would be much longer, but I have shortened for sample purpose:

String s1= ("henry2O|*|bob10|*|mark20|*|justin30|*|kyle15|*|95|*|henry3O|*|bob50|*|mark70|*|justin30|*|kyle25|*|1000|");

Basically, this pattern keeps repeating itself and I would like to match with a regex only the tokens where henry, mark, kyle, appear (with their number included) or the tokens containing only numbers.

Essentially, my output should look like this:

(henry20, mark20, kyle15, 95, henry30, mark70, kyle25, 1000)

CodePudding user response：

You could use this regex:

(^(((henry|mark|kyle)\d (?=\|))|\d (?=\|))|((?<=\|)(henry|mark|kyle)\d (?=\|))|(?<=\|)\d (?=\|)|((?<=\|)(henry|mark|kyle)\d |(?<=\|)\d )$)

This regex can be basically broken down in three parts:

one checking that an occurrence appears at the beginning (start of the string and pipe following the token)

^(((henry|mark|kyle)\d (?=\|))|\d (?=\|))

one checking that an occurrence appears in between (a pipe character before and after the token).

(((?<=\|)(henry|mark|kyle)\d (?=\|))|(?<=\|)\d (?=\|))

one checking that an occurrence appears at the end of the string (a pipe character before the token followed by the end of the string).

((?<=\|)(henry|mark|kyle)\d |(?<=\|)\d )$

I know that in your example the last token shows a pipe after it, but since I've seen that the first one is not preceded by a pipe, I imagined that the string could also not finish with a pipe, so I've included this case too.

Each one of these three bits can be further broken down into:

A positive lookbehind that makes sure that every token is preceded by a pipe (second and third case).
A capturing group to either match an occurrence made of henry, mark or kyle followed by one or more digits or an occurrence made of only one or more digits.
A positive lookahead that makes sure that every token is followed by a pipe character (first and second case).

Here is a link to test the regex:

https://regex101.com/r/WWEQZM/4

CodePudding user response：

I would suggest to use the split() method os string, i see it as more readable than a long regex.

The regex to split - (\|\*)*\|. Means pipe followed by asterisk, zero or more times, followed by pipe.

Then you can check if the element is number, or contains henry, mike or kyle.

public class Temp {

    public static void main(String[] args) throws Exception {
        String s1 = ("henry2O|*|bob10|*|mark20|*|justin30|*|kyle15|*|95|*|henry3O|*|bob50|*|mark70|*|justin30|*|kyle25|*|1000|");
        String[] data = s1.split("(\\|\\*)*\\|");
        StringJoiner joiner = new StringJoiner(", ", "(", ")");
        for (String string : data) {
            if (string.matches("\\d ") || string.contains("henry") || string.contains("mark") || string.contains("kyle")) {
                joiner.add(string);
            }
        }
        System.out.println(joiner);
    }
}

Another option would be to extract alphanumeric matches, since it looks like you need only alphanumeric stuff. \w matches alphanumerics and underscore.

Pattern pattern = Pattern.compile("\\w ");
Matcher matcher = pattern.matcher(s1);
//StringJoiner, etc.
while (matcher.find()) {
    String string = matcher.group();
    if (string.matches("\\d ") || string.contains("henry") || string.contains("mark") || string.contains("kyle")) {
        joiner.add(string);
    }
}
//print