We have a situation where doing a wildcard search with single character at the start and then other characters after the wild card and it runs incredibly slowly (in c# at least). Is there a reason for this and a way to improve things? Its quicker in almost all other cases.
examples for a 20k long random string run 1000 times:
- a.*r1 time taken: 1802
- r1.*a time taken: 9
- r1.*b.*c time taken: 9
- r1f.*b.*c time taken: 16
- a.*r1f.*c time taken: 3199
- a.*r1.*c time taken: 1895
- a.*b.*r1f time taken: 55450
Its definitely not the random string, as have tried different ones.
The pattern is definitely that if the first part is a single character followed by any characters after the wildcard, its always much much slower.
--Update--
I wonder if the way Regex works is that it loops through looking for that single character, and when it finds it, it searchs till then end looking for the next pattern. When it doesn't find it it goes back to that first character and starts looking for the next first character till it finds the first match again and does the some full logic, even though it could skip all those characters it passed on the first run.
I think I have confirmed this by generating a random string without character "a" - if I then use this character as the first character its really quick, but if I use "c" its slow. i.e. a.*b.*r1f is instant in that case but c.*b.*r1f takes a very long time.
If so wonder if you can optimise this in regex somehow?
CodePudding user response:
The reason for this difference in performance lies in the way in which the search is optimized.
When a pattern starts with literal characters, a fast algorithm is used before the "normal walk" of the regex engine to find possible positions in the string where the pattern may succeed (positions where the literal substring is). Then the pattern is tested only at these positions by the regex engine.
This is the reason why a pattern that starts with the letter a
, for a string (whatever the size) that doesn't contain the letter 'a' is quickly solved (no match, the whole pattern is never tested).
Now why for the same kind of pattern, one that starts with the letter a
(only one literal character) and one that starts with abcd
gives most of the time different performances in a random string. The answer is simple, positions with the four characters abcd
are less frequent than positions with only a
. Less positions to try => faster result.
Also note that a pattern like a.*b.*c
is called a pathological pattern since it may cause a potential explosion of the number of backtracking steps. If using non-greedy quantifiers may sometimes reduce the problem, there's no guarantee that it improves always the performances (It isn't a magic wand). The best way is always to be the most precise with appropriate character classes, appropriate quantifiers, and the most accurate description of the string, avoiding when possible .*
or .*?
. a[^b]*b[^c]*c
for instance.