I am facing performance issues when using bind variables with DBI package. My original use case is with Postgres database, but for reproducibility, below I use in-memory SQLite that has exactly the same issue - when I select data from some table by id (in Postgres the column is indexed) the parametrized version runs multiple times longer on selecting number of rows than SQL with IDs pasted in IN
statement:
library(DBI)
library(tictoc)
sample.data <- data.frame(
id = 1:100000,
value = rnorm(100000)
)
sqlite <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(
sqlite, "sample_data",
sample.data,
overwrite = T
)
tic("Load by bind")
ids <- 50000:50100
res <- dbSendQuery(sqlite, "SELECT * FROM sample_data WHERE id = $1")
dbBind(res, list(ids))
result <- dbFetch(res)
dbClearResult(res)
toc()
# Load by bind: 0.81 sec elapsed
tic("Load by paste")
ids <- 50000:50100
result2 <- dbGetQuery(sqlite, paste0("SELECT * FROM sample_data WHERE id IN (", paste(ids, collapse = ","), ")"))
toc()
# Load by paste: 0.04 sec elapsed
Seems that I should have some obvious mistake as prepared queries are supposed to be faster (and I do see it with Python/SQLAlchemy on the same Postgres example).
CodePudding user response:
Your first query ... id = $1
is executed 101 times; your second query ... id in (..)
is executed once. If you audit on the DBMS side (not demonstrating that here), then you'd see 101 individual queries.
Up front, a common mistake is to simplify modify your statement to use an IN (?)
clause,
dbGetQuery(pgcon, "SELECT * FROM sample_data WHERE id in (?)", params = list(ids))
but this also executes the query 101 times, feeling the same performance woes of result1
.
To use parameter binding with the much more efficient IN (..)
clause, you need to provide that many question-marks (or dollar-numbers).
bench::mark(
result1 = dbGetQuery(sqlite, "SELECT * FROM sample_data WHERE id = $1", params = list(ids)),
result2 = dbGetQuery(sqlite, paste0("SELECT * FROM sample_data WHERE id IN (", idcommas, ")")),
result3 = dbGetQuery(sqlite, paste0("SELECT * FROM sample_data WHERE id IN (", qmarks, ")"),
params = as.list(ids)),
min_iterations = 50
)
# # A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 result1 280.97ms 347.4ms 2.86 20.6KB 0 50 0 17.5s <df[,2] [101 x 2]> <Rprofmem[,3] [14 x 3]> <bch:tm [50]> <tibble [50 x 3]>
# 2 result2 7.31ms 8.21ms 115. 15.6KB 0 58 0 502.2ms <df[,2] [101 x 2]> <Rprofmem[,3] [12 x 3]> <bch:tm [58]> <tibble [58 x 3]>
# 3 result3 7.57ms 8.93ms 113. 28.4KB 0 57 0 506ms <df[,2] [101 x 2]> <Rprofmem[,3] [28 x 3]> <bch:tm [57]> <tibble [57 x 3]>
If you're curious, it performs the same (significantly faster) on a postgres instance (though I changed your $1
to ?
: sqlite accepts both, odbc/postgres only supports qmarks):
pgcon <- dbConnect(odbc::odbc(), ...) # local docker postgres instance
bench::mark(...)
# # A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 result1 967.4ms 1.05s 0.933 20.6KB 0 50 0 53.57s <df[,2] [101 x 2]> <Rprofmem[,3] [14 x 3]> <bch:tm [50]> <tibble [50 x 3]>
# 2 result2 57ms 67.7ms 14.4 18.1KB 0 50 0 3.47s <df[,2] [101 x 2]> <Rprofmem[,3] [13 x 3]> <bch:tm [50]> <tibble [50 x 3]>
# 3 result3 56.9ms 65.17ms 14.3 21.4KB 0 50 0 3.5s <df[,2] [101 x 2]> <Rprofmem[,3] [15 x 3]> <bch:tm [50]> <tibble [50 x 3]>
I also tested on odbc/sql-server with very similar results.
result2
and result3
are generally quite close here on all three DBMSes, and in fact on different samplings the former is faster than the latter, so I'll call their performance comparison a wash. What, then, is the motivation for using binding? In many cases, it's mainly an academic discussion: most of the time, you're doing nothing wrong by not using it (and instead using your paste(ids, collapse=",")
method).
However:
Inadvertent "sql injection". Technically SQL injection must be malevolent to be labeled as such, but I informally attribute "oops" moments in SQL queries where the data has embedded quotes, and by pasting it into a static query string, I broke the quoting. Fortunately for me, all it did was break the parsing of the query, I have not deleted this year's student records.
A common mistake is to try to use
sQuote
to escape embedded quotes. Long-story-short: nope, SQL does it differently. Many SQL users do not know that to escape an embedded single quote, one must double it:sQuote("he's Irish") # [1] "'he's Irish'" # WRONG DBI::dbQuoteString(sqlite, "he's Irish") # <SQL> 'he''s Irish' # RIGHT for both sqlite and pgcon
Query optimization. Most (all? I don't know for certain) DBMSes do some form of query optimization, attempting to take advantage of an index and/or similar measures. To be good at it, this optimization is done once for query and then cached. However, if you change even one letter of the query, you risk a cache-miss (I am not saying "always" because I have not audited the caching code ... but the premise is clear, I think). This means that changing a query from
select * from mytable where a=1
to... a=2
does not get a cache-hit, and it is optimized (again).Contrast that with
select * from mytable where a=?
with parameter-binding, and you benefit from the caching.Note that if your list of
ids
changes length, then the query is likely to be re-optimized (changes fromid in (?,?)
toid in (?,?,?)
); I don't if that is truly a cache miss, again not having audited the DBMSes code.
BTW: your mention of "prepared statement" is very much aligned with this query optimization, but the performance penalty you are experiencing is much more about running the same query 101 times than anything to do with cache hits/misses.