Home > OS >  Separate string starting from a parenthesis occurrence (regex)
Separate string starting from a parenthesis occurrence (regex)

Time:10-22

How can I achieve something like this: "Ca(OH)2" => "Ca" and "(OH)2"

In python, it can be achieved like this:

import re

compound = "Ca(OH)2"
segments=re.split('(\([A-Za-z0-9]*\)[0-9]*)',compound)   
print(segments)

Output: ['Ca', '(OH)2', ''] 

I am following this tutorial from https://medium.com/swlh/balancing-chemical-equations-with-python-837518c9075b (except that I wanted to do it in Java)

(\([A-Za-z0-9]*\)[0-9]*) To breakdown the regex, the outermost parenthesis(near the single quotes) indicate that that is our capture group and it is what we want to keep. The inner parenthesis with the forward slash before them mean that we want to literally find parenthesis(this is called escaping) the [A-Za-z0–9] indicate that we are ok with any letter(of any case) or number within our parentheses and the asterisk after the square brackets is a quantifier. It means that we are ok with having zero or infinite many letters(of any case) or numbers within our parenthesis. and the [0–9] near the end, indicate that we want to include ALL digits to the right of our parenthesis in our split.

I tried to do it in Java but the output was not what I wanted:

String compound = "Ca(OH)2";
String[] segments = compound.split("(\\([A-Za-z0-9]*\\)[0-9]*)");
System.out.println(Arrays.toString(segments));

Output: [Ca]

CodePudding user response:

In Java, unlike Python re.split method, String#split does not keep captured parts.

You can use the following code in Java:

String s = "Ca(OH)2";
Pattern p = Pattern.compile("\\([A-Za-z0-9] \\)[0-9]*|[A-Za-z0-9] ");
Matcher m = p.matcher(s);
List<String> res = new ArrayList<>();
while(m.find()) {
    res.add(m.group());
}
System.out.println(res); // => [Ca, (OH)2]

See the online demo. Here, \([A-Za-z0-9] \)[0-9]*|[A-Za-z0-9] regex matches

  • \([A-Za-z0-9] \)[0-9]* - (, one or more ASCII letters/digits, ) and then zero or more digits
  • | - or
  • [A-Za-z0-9] - one or more ASCII letters/digits.

See the regex demo. It can also be written as

Pattern p = Pattern.compile("\\(\\p{Alnum} \\)\\d*|\\p{Alnum} ");

CodePudding user response:

Try this mate:

String[] segments = compound.split("([^\\w*])");

so output should be :

ca , oh ,2 

Hopefully it will help you!

  • Related