R DBI bind variable performance-CodePudding

I am facing performance issues when using bind variables with DBI package. My original use case is with Postgres database, but for reproducibility, below I use in-memory SQLite that has exactly the same issue - when I select data from some table by id (in Postgres the column is indexed) the parametrized version runs multiple times longer on selecting number of rows than SQL with IDs pasted in IN statement:

library(DBI)
library(tictoc)

sample.data <- data.frame(
  id = 1:100000,
  value = rnorm(100000)
)

sqlite <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(
  sqlite, "sample_data",
  sample.data,
  overwrite = T
)

tic("Load by bind")
ids <- 50000:50100
res <- dbSendQuery(sqlite, "SELECT * FROM sample_data WHERE id = $1")
dbBind(res, list(ids))
result <- dbFetch(res)
dbClearResult(res)
toc()

# Load by bind: 0.81 sec elapsed

tic("Load by paste")
ids <- 50000:50100
result2 <- dbGetQuery(sqlite, paste0("SELECT * FROM sample_data WHERE id IN (", paste(ids, collapse = ","), ")"))
toc()

# Load by paste: 0.04 sec elapsed

Seems that I should have some obvious mistake as prepared queries are supposed to be faster (and I do see it with Python/SQLAlchemy on the same Postgres example).

CodePudding user response：

Your first query ... id = $1 is executed 101 times; your second query ... id in (..) is executed once. If you audit on the DBMS side (not demonstrating that here), then you'd see 101 individual queries.

Up front, a common mistake is to simplify modify your statement to use an IN (?) clause,

dbGetQuery(pgcon, "SELECT * FROM sample_data WHERE id in (?)", params = list(ids))

but this also executes the query 101 times, feeling the same performance woes of result1.

To use parameter binding with the much more efficient IN (..) clause, you need to provide that many question-marks (or dollar-numbers).

bench::mark(
  result1 = dbGetQuery(sqlite, "SELECT * FROM sample_data WHERE id = $1", params = list(ids)),
  result2 = dbGetQuery(sqlite, paste0("SELECT * FROM sample_data WHERE id IN (", idcommas, ")")),
  result3 = dbGetQuery(sqlite, paste0("SELECT * FROM sample_data WHERE id IN (", qmarks, ")"),
                       params = as.list(ids)),
  min_iterations = 50
)
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result             memory                  time          gc               
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>             <list>                  <list>        <list>           
# 1 result1    280.97ms  347.4ms      2.86    20.6KB        0    50     0      17.5s <df[,2] [101 x 2]> <Rprofmem[,3] [14 x 3]> <bch:tm [50]> <tibble [50 x 3]>
# 2 result2      7.31ms   8.21ms    115.      15.6KB        0    58     0    502.2ms <df[,2] [101 x 2]> <Rprofmem[,3] [12 x 3]> <bch:tm [58]> <tibble [58 x 3]>
# 3 result3      7.57ms   8.93ms    113.      28.4KB        0    57     0      506ms <df[,2] [101 x 2]> <Rprofmem[,3] [28 x 3]> <bch:tm [57]> <tibble [57 x 3]>

If you're curious, it performs the same (significantly faster) on a postgres instance (though I changed your $1 to ?: sqlite accepts both, odbc/postgres only supports qmarks):

pgcon <- dbConnect(odbc::odbc(), ...) # local docker postgres instance
bench::mark(...)
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result             memory                  time          gc               
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>             <list>                  <list>        <list>           
# 1 result1     967.4ms    1.05s     0.933    20.6KB        0    50     0     53.57s <df[,2] [101 x 2]> <Rprofmem[,3] [14 x 3]> <bch:tm [50]> <tibble [50 x 3]>
# 2 result2        57ms   67.7ms    14.4      18.1KB        0    50     0      3.47s <df[,2] [101 x 2]> <Rprofmem[,3] [13 x 3]> <bch:tm [50]> <tibble [50 x 3]>
# 3 result3      56.9ms  65.17ms    14.3      21.4KB        0    50     0       3.5s <df[,2] [101 x 2]> <Rprofmem[,3] [15 x 3]> <bch:tm [50]> <tibble [50 x 3]>

I also tested on odbc/sql-server with very similar results.

result2 and result3 are generally quite close here on all three DBMSes, and in fact on different samplings the former is faster than the latter, so I'll call their performance comparison a wash. What, then, is the motivation for using binding? In many cases, it's mainly an academic discussion: most of the time, you're doing nothing wrong by not using it (and instead using your paste(ids, collapse=",") method).

However:

Inadvertent "sql injection". Technically SQL injection must be malevolent to be labeled as such, but I informally attribute "oops" moments in SQL queries where the data has embedded quotes, and by pasting it into a static query string, I broke the quoting. Fortunately for me, all it did was break the parsing of the query, I have not deleted this year's student records.

A common mistake is to try to use sQuote to escape embedded quotes. Long-story-short: nope, SQL does it differently. Many SQL users do not know that to escape an embedded single quote, one must double it:
```
sQuote("he's Irish")
# [1] "'he's Irish'"                      # WRONG

DBI::dbQuoteString(sqlite, "he's Irish")
# <SQL> 'he''s Irish'                     # RIGHT for both sqlite and pgcon
```
Query optimization. Most (all? I don't know for certain) DBMSes do some form of query optimization, attempting to take advantage of an index and/or similar measures. To be good at it, this optimization is done once for query and then cached. However, if you change even one letter of the query, you risk a cache-miss (I am not saying "always" because I have not audited the caching code ... but the premise is clear, I think). This means that changing a query from select * from mytable where a=1 to ... a=2 does not get a cache-hit, and it is optimized (again).

Contrast that with select * from mytable where a=? with parameter-binding, and you benefit from the caching.

Note that if your list of ids changes length, then the query is likely to be re-optimized (changes from id in (?,?) to id in (?,?,?)); I don't if that is truly a cache miss, again not having audited the DBMSes code.

BTW: your mention of "prepared statement" is very much aligned with this query optimization, but the performance penalty you are experiencing is much more about running the same query 101 times than anything to do with cache hits/misses.