I have created a java regex
(?=\b((?<![^\p{Alnum}\p{Punct}])(?:\p{Alnum} \p{Punct}\p{Alnum} ){2})\b)
I tested this regex against a sample string: https://www.google.com.google.com
It is giving me all the expected tokens:
www.google.com google.com.google com.google.com
But, the issue with the above regex is that it is taking a lot of time when tested with a large string.
My expected tokens are in the form of "alphanumeric punctuation alphanumeric".
How can I optimize this regex?
CodePudding user response:
You need to simplify the regex like this:
(?=\b((?:\p{Alnum} \p{Punct}){2}\p{Alnum} )\b)
See the regex demo.
Details:
\b
- a word boundary((?:\p{Alnum} \p{Punct}){2}\p{Alnum} )
- Group 1:(?:\p{Alnum} \p{Punct}){2}
- two occurrences of one or more letters/digits and a punctuation char and then\p{Alnum}
- one or more letters/digits
\b
- a word boundary
Note that each subsequent pattern does not match at the same location inside the string, which makes it as efficient as it can be (still, the overlapping patterns performance is not that great since they must evaluate each position inside the string, from left to right).