I am working with two databases, one is MS SQL Server, the other is SQLite. Both contain identical data which I can and have verified (at least, they are as identical as the different languages will allow). In working with the two, I came across a confusing difference in the way the two languages perform:
When I run the following query in SQL Server:
SELECT
count(*)
FROM
Pattern as p
WHERE
'RK69M|1M116849' like replace(p.Keys, '*', '_') '%'
I get:
47040
But when I run the equivalent query in SQLite (the only difference is concatenation in SQL Server is
but SQLite uses ||
):
SELECT
count(*)
FROM
Pattern as p
WHERE
'RK69M|1M116849' like replace(p.Keys, '*', '_') || '%'
I get: 43197
Can anyone explain this? Do they use different regex's for matching?
If it matters, in both, the count of records in the the table (removing the where
clauses) is 1304884
I have also tried running the query through multiple channels (TSQL, python, GUI-based query tools, etc), and all get the same result. I've also tested the data with both a python script to compare them and by dumping them to a text file and using the diff command in linux, so I am confident it is the same data in each database.
CodePudding user response:
[In SQL Server] I get: 47040
[In Sqlite] I get: 43197
Can anyone explain this? Do they use different regex's for matching?
It's not regex at all. LIKE
is its own thing. But as is often the case with this kind of question, we can gain insight by going to the documentation.
Here is SQL Server's LIKE operator documentation
The relevant portion from SQL Server describes four different tokens for pattern matching: %
, _
, []
, and [^ ]
Here is Sqlite's documentation
(scroll down to section 5)
The relevant portion only describes the first two pattern tokens: %
and _
.
The documentation for both databases includes additional information on things like escape characters and such that are not used in your query, except the Sqlite documentation includes this:
The LIKE operator is case sensitive by default for unicode characters that are beyond the ASCII range.
I don't know the nature of your data or whether that will matter here, but I can say SQL Server's handling of the same issue depends on the collation, and therefore isn't necessarily the same.
The point is there are documented differences in how the two database engines behave in this area, and therefore given significant data you should expect somewhat different results.