I am trying to extract the whole portion of a decimal with units. Conditions:
- ignore/don't match numbers(whole portion) greater than 4 digits, anything greater is invalid.
- decimal digits are not expected to be more than 2
- whole numbers without decimals must be supported.
- Numbers can be in the middle of a string with alphabetical characters, starting, ending or only the numbers without any alphabetical characters.
So far this is what I got in Java. If I were to guess, the problem I am having is that the \d{1,4} pattern is matching with the decimal portion and returning the digits after decimal for the last use case below. All the first three asserts work as expected. Greatly appreciate any help.
@Test
public void testTest(){
Pattern pattern = Pattern.compile("\\b\\d{1,4}(?:(\\.\\d{1,2})?) ml\\b", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher("5432 ml");
Matcher matcher2 = pattern.matcher("54321 ml");
Matcher matcher3 = pattern.matcher("1234.0 ml");
Matcher matcher4 = pattern.matcher("Start 12345.0 ml end");
String result = matcher.find() ? matcher.group() : "";
String result2 = matcher2.find() ? matcher2.group() : "";
String result3 = matcher3.find() ? matcher3.group() : "";
String result4 = matcher4.find() ? matcher4.group() : "";
Assert.assertEquals(result, "5432 ml"); //passes, extracts correctly
Assert.assertEquals(result2, ""); //passes, ignores because the whole number length >4
Assert.assertEquals(result3, "1234 ml"); // fails - result3 = 1234.0 ml
Assert.assertEquals(result4, ""); // fails result4 = 0 ml
}
What is really driving me nuts is why the non-capturing group is being captured in the last assert. My understanding is that it should never return the decimal portion with '?:' Where am I wrong?
CodePudding user response:
It does return "0 ml"
since the regex matches the 0
and Matcher::group
(oracle.com
) returns the full match of the regex. To prevent this, we can use negative lookbehind (regular-expressions.info
) to only match a number that is not prefixed by a .
. Furthermore, since we need to extract certain parts from the pattern (the number and the unit), I suggest using named groups (regular-expressions.info
). This results in the following regular expression
\b(?<!\.)(?<number>\d{1,4})(?:(\.\d{1,2})?)(?<unit> ml)\b
Translated to java, we end up with the following code:
final Pattern pattern = Pattern.compile(
"\\b(?<!\\.)(?<number>\\d{1,4})(?:(\\.\\d{1,2})?)(?<unit> ml)\\b",
Pattern.CASE_INSENSITIVE);
final Matcher matcher = pattern.matcher("5432 ml");
final Matcher matcher2 = pattern.matcher("54321 ml");
final Matcher matcher3 = pattern.matcher("1234.0 ml");
final Matcher matcher4 = pattern.matcher("Start 12345.0 ml end");
final String result = matcher.find()
? matcher.group("number") matcher.group("unit")
: "";
final String result2 = matcher2.find()
? matcher2.group("number") matcher2.group("unit")
: "";
final String result3 = matcher3.find()
? matcher3.group("number") matcher3.group("unit")
: "";
final String result4 = matcher4.find()
? matcher4.group("number") matcher4.group("unit")
: "";