Home > Mobile >  SQLite regex - not PCRE? What are the rules?
SQLite regex - not PCRE? What are the rules?

Time:09-15

I'm pretty used to the PCRE regex flavor at this point, and some other alternatives like plain grep's style. But, I couldn't figure out how to do some basic things in the new built-in SQLite regexes (version 3.37.2)

Attempting to match a whole string with ^ and $ anchors works as expected:

sqlite> SELECT 'pancakes' REGEXP '^pancakes$';
1

I'd expect this next one to work. I always thought | had "everything to left" and "everything to right" behavior (low "precedence"?):

sqlite> SELECT 'pancakes' REGEXP '^pan|cakes$';
0

That works fine in Python:

>>> re.search(r"^pan|cakes$", "pancakes")
<re.Match object; span=(0, 3), match='pan'>

Maybe I have to escape |? But doesn't work:

sqlite> SELECT 'pancakes' REGEXP '^pan\|cakes$';
0

Does | work at all? Yes, it seems to, but maybe not doing what I think:

sqlite> SELECT 'pancakes' REGEXP 'pan|cakes';
1

Maybe the precedence is off, I'll group to be safe. Uh oh:

sqlite> SELECT 'pancakes' REGEXP '(^pan)|(cakes$)';
Error: stepping, unmatched '(' (1)

(escaping the parens also gave the same error)

How can I do grouping, anchoring, | etc? I wasn't able to find much about the regex style in the docs, just something saying that REGEXP operator is really using some regex function under the hood.

CodePudding user response:

I found the code for the default regexp implementation: https://sqlite.org/src/file/ext/misc/regexp.c

The syntax documented in a code comment block suggests that the syntax you used should work, but I find it does not. I tested on 3.37.0 and then upgraded to 3.39.3 but it still fails.

I was able to get it to work this way:

sqlite> SELECT 'pancakes' REGEXP '(^pan)|cakes$';
1

It seems in the SQLite regex parser, | binds tighter than ^ or $, so without the parentheses, it evaluates as if we had written it this way:

sqlite> SELECT 'pancakes' REGEXP '^(pan|cakes)$';
0

This is obviously not going to match, because it would only match the whole string "pan" or "cakes".

This is not compatible with PCRE semantics.

I also tried to use explicit parentheses to override the precedence of |:

sqlite> SELECT 'pancakes' REGEXP '(^pan)|(cakes$)';
Runtime error: unmatched '('

This should work, but it hits a bug in SQLite with respect to ^ and $. Read these threads for the discussion from July 2022:

  • Related