I am using a regex to match few possible values in a string that coming with my objects, there I need to get all possible values that are matching from my string as below,
If my string value is "This is the code ABC : xyz use for something"
.
Here is my code that I am using to extract matchers,
String my_regex = "(ABC|ABC :).*";
List <String> matchers = Pattern.compile(my_regex, Pattern.CASE_INSENSITIVE)
.matcher(my_string)
.results()
.map(MatchResult::group)
.collect(Collection.toList());
I am expecting the 2 list items as the output > {"ABC", "ABC :"}, But I am only getting one. Help would be highly appreciated.
CodePudding user response:
What you describe just isn't how regex engines work. They do not find all possible variant search results; they simply consume and give you all results, moving forward. In other words, had you written:
String my_regex = "(ABC|ABC :)"; // note, get rid of the .*
String myString = "This is the code ABC : xyz use for something ABC again";
Then you'd get 2 results back - ABC :
and ABC
.
Yes, the regex could just as easily match just the ABC
part instead of the ABC :
part and it would still be valid. However, regexp matching is by default 'greedy' - it will match as much as it can. For some operators (specifically, *
and
) you can use the non-greedy variants: *?
and ?
which will match as little as possible.
In other words, given:
String regex = "(a*?)(a )";
String myString = "aaaaa";
Then group 1 would match 0 a
(that's the shortest string that can match (a*?)
whilst still being able to match the entire regex to the input), and group 2 would be aaaaa
.
If, on the other hand, you wrote (a*)(a )
, then group 1 would be aaaa
and group 2 would be a
. It is not possible to ask the regexp engine to provide for you the combinatory explosion, with every possible length of 'a' - which appears to be what you want. The regexp API that ships with java does not have any option to do this, nor does any other regexp API I know of, so you'd have to write that yourself, perhaps. I admit I haven't scoured the web for every possible alternate regex engine impl for java, there are a bunch of third party libraries, perhaps one of them can do it.
NB: I said at the start: Get rid of the .*. That's because otherwise it's still just the one match: ABC : xyz use for something ABC again
is the longest possible match and given that regex engines are greedy, that's what you will get: It is a valid 'interpretation' of your string (1 match), consuming the most - that's how it works.
NB2: Greediness can never change whether a regex even matches or not. It just changes which of the input is assigned to which group, and when find()
ing more than once (which .results()
does - it find()
s until no more matches are found - which matches you get.