Home > Enterprise >  Java regex getting surrounding text extremely slow
Java regex getting surrounding text extremely slow

Time:09-14

I am currently using java regex matching to look through a very large number of files, and locate all files that match the pattern. I want to be able to show the matching text in context, so the regex pattern I am using is:

[\\S|\\s]{0,30}abc[\\S|\\s]{0,30}

This successfully grabs "abc" as well as up to 30 of any characters on either side.

However, when running this search through my large database of files, the search takes 21 minutes to complete (while removing the padding and searching for just "abc" makes it run in <1 minute).

Is there a reason that my regex runs so slowly? Is there a more efficient way to grab the context?


NOTE: I am just using java.util.regex.Matcher and java.util.regex.Pattern in a loop over each file. I can create a short mockup of the relevant code if needed, though

CodePudding user response:

Instead of [\\S|\\s], it's a bit smarter to use Pattern.DOTALL to make . match anything. (or just use (?s).{0,30}abc.{0,30} - (?s) turns the DOTALL flag on from within the regexp.

Your basic regular expression can be translated to a Thompson's Construction which means you are guaranteed a performance of O(n m) where n stands for the length of your regexp and m for the length of your input. The input is large, but at least it's a linear growth curve. However, TCs do not support backrefs and there are a few other more exotic reasons why most regexp engines don't actually apply TC conversion. As far as I know, java's regex engine doesn't do it either, which means that regular expressions are technically performance unbounded - you can write a regexp that takes years to resolve. Trivially, there exists a regex that does prime number calculation (matches if the input is a prime number of x, fails if it is anything else). This trivially proves that regex engines can take a very long time to do their work. This regex uses backrefs, thus why it doesn't "disprove" that TC conversion can guarantee speedy evalulation of regexes.

Point is: Regexes can be slow if TCs aren't built, and java's regex engine does not build them.

[\\S|\\s]{0,30} is particularly egregious, adding quite a few complications; you're causing a combinatory explosion of different tries for that regex engine to work through. A TC construction can do it quickly, but as we established, java doesn't do that.

I bet if you just use DOTALL mode, it'll be speedy again.

  • Related