Home > Software design >  A regular expression captures more text than it needs
A regular expression captures more text than it needs

Time:09-16

I want to get the whole value 97.47 but the regular expression splits it by 9 and by 7.47 adding it to different fields This is the regular expression that is used

private static final Pattern COMMISSION_PATTERN =
            Pattern.compile(
                    "(total\\[((?:(?<totalFixed>\\d )(\\s*(\\ )\\s*)?)?"  
                    "((?<totalPercent>\\d (\\.\\d{1,2})?)\\s*%)?"  
                    "(\\s*min\\s*(?<totalMin>\\d ))?"  
                    "(\\s*max\\s*(?<totalMax>\\d ))?"  
                    "(\\s*round\\s*(?<totalRound>\\d ))?)?\\])?(\\s*)"  
                    "(partner\\[(?:(\\s*negative:\\s*(?<partnerNegative>(true|false))?\\s*,\\s*)?"  
                    "((?<partnerFixed>\\d )(\\s*(\\ )\\s*)?)?"  
                    "((?<partnerPercent>\\d (\\.\\d{1,2})?)\\s*%)?"  
                    "(\\s*min\\s*(?<partnerMin>\\d ))?"  
                    "(\\s*max\\s*(?<partnerMax>\\d ))?"  
                    "(\\s*round\\s*(?<partnerRound>\\d ))?"  
                    "(\\s*mode\\s*(?<partnerMode>\\w ))?)?\\])?");

The following value arrives in the method "total[0] partner[97.47%]" it is parsed in this way:

String sCommission = "total[0] partner[97.47%]";
for (String comm : sCommission.split("\n")) {
     Matcher matcher = COMMISSION_PATTERN.matcher(comm.trim());
     if (matcher.matches()) {
String sPartnerFixed = matcher.group("partnerFixed");//9
String sPartnerPercent = matcher.group("partnerPercent"); //7.47

And it should be:

String sPartnerFixed = matcher.group("partnerFixed"); //null
String sPartnerPercent = matcher.group("partnerPercent"); //97.47

I can't figure out where the error is in the regular expression

CodePudding user response:

The (\s*(\ )\s*)? part in the ((?<partnerFixed>\d )(\s*(\ )\s*)?)? part is optional, and \d in the partnerFixed group becomes "adjacent" (it can be backtracked into) to the (?<partnerPercent>\d (?:\.\d{1,2})?) part of the regex (where \d also is required and matches one or more digits). So, this behavior you have is expected, unless you tell the regex engine to clearly have an obligatory pattern between these two number matching parts.

A possible solution would be a word boundary after \d in the (?<partnerFixed>\d ) part, i.e. replace "((?<partnerFixed>\\d )(\\s*(\\ )\\s*)?)?" with "((?<partnerFixed>\\d \\b)(\\s*(\\ )\\s*)?)?".

A more sophisticated and more precise way to solve this issue is to make some part of the (\s*(\ )\s*)? pattern obligatory. That is, you do not expect a match for partnerFixed if there is a single streak of digits optionally followed with . and one or two digits. If there is a partnerFixed number, what should it be separated with from the next value? I think there should be a whitespace or enclosed with optional whitespaces, just deducing it from the pattern.

In this latter case, you can replace "((?<partnerFixed>\\d )(\\s*(\\ )\\s*)?)?" with "((?<partnerFixed>\\d )(\\s |\\s*\\ \\s*))?".

See this regex demo.

  • Related