I'm writing an R query that runs several SQL queries using the DBI package to create reports. To make this work, I need to be able to declare a variable in R (such as a Period End Date) that is then called from within the SQL query. When I run my query, I get the following error:
If I simply use the field name (PeriodEndDate), I get the following error:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbGetQuery’ for signature ‘"Microsoft SQL Server", "character"’
If I use @ to access the field name (@PeriodEndDate), I get the following error:
Error: nanodbc/nanodbc.cpp:1655: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]Must declare the scalar variable "@PeriodEndDate". [Microsoft][ODBC SQL Server Driver][SQL Server]Statement(s) could not be prepared. '
An example query might look like this:
library(DBI) # Used for connecting to SQL server and submitting SQL queries.
library(tidyverse) # Used for data manipulation and creating/saving CSV files.
library(lubridate) # Used to calculate end of month, start of month in queries
# Define time periods for queries.
PeriodEndDate <<- ceiling_date(as.Date('2021-10-31'),'month') # Enter Period End Date on this line.
PeriodStartDate <<- floor_date(PeriodEndDate, 'month')
# Connect to SQL Server.
con <- dbConnect(
odbc::odbc(),
driver = "SQL Server",
server = "SERVERNAME",
trusted_connection = TRUE,
timeout = 5,
encoding = "Latin1")
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= @PeriodEndDate
")
I believe one way might be to use the paste function, like this:
samplequery <- dbGetQuery(con, paste("
SELECT * FROM [TableName]
WHERE OrderDate <=", PeriodEndDate")
However, that can get unwieldy if it involves several variables being referenced outside the query or in several places within the query.
Is there a relatively straightforward way to do this?
Thanks in advance for any thoughts you might have!
CodePudding user response:
The mechanism in most DBI
-based connections is to use ?
-placeholders[1] in the query and params=
in the call to DBI::dbGetQuery
or DBI::dbExecute
.
Perhaps this:
samplequery <- dbGetQuery(con, "
SELECT * FROM [TableName]
WHERE OrderDate <= ?
", params = list(PeriodEndDate))
In general the mechanisms for including an R object as a data-item are enumerated well in https://db.rstudio.com/best-practices/run-queries-safely/. In the order of my recommendation,
- Parameterized queries (as shown above);
glue::glue_sql
;sqlInterpolate
(which uses the same?
-placeholders as #1);- The link also mentions "manual escaping" using
dbQuoteString
.
Anything else is in my mind more risky due to inadvertent SQL corruption/injection.
I've seen many questions here on SO that try to use one of the following techniques: paste
and/or sprintf
using sQuote
or hard-coded paste0("'", PeriodEndDate, "'")
. These are too fragile in my mind and should be avoided.
My preference for parameterized queries extends beyond this usability, it also can have non-insignificant impacts on repeated use of the same query, since DBMSes tend to analyze/optimize the query and cache this for the next use. Consider this:
### parameterized queries
DBI::dbGetQuery("select ... where OrderDate >= ?", params=list("2020-02-02"))
DBI::dbGetQuery("select ... where OrderDate >= ?", params=list("2020-02-03"))
### glue_sql
PeriodEndDate <- as.Date("2020-02-02")
qry <- glue::glue_sql("select ... where OrderDate >= {PeriodEndDate}", .con=con)
# <SQL> select ... where OrderDate >= '2020-02-02'
DBI::dbGetQuery(con, qry)
PeriodEndDate <- as.Date("2021-12-22")
qry <- glue::glue_sql("select ... where OrderDate >= {PeriodEndDate}", .con=con)
# <SQL> select ... where OrderDate >= '2021-12-22'
DBI::dbGetQuery(con, qry)
In the case of parameterized queries, the "query" itself never changes, so its optimized query (internal to the server) can be reused.
In the case of the glue_sql
queries, the query itself changes (albeit just a handful of character), so most (all?) DBMSes will re-analyze and re-optimize the query. While they tend to do it quickly, and most analysts' queries are not complex, it is still unnecessary overhead, and missing an opportunity in cases where your query and/or the indices require a little more work to optimize well.
Notes:
?
is used by most DBMSes but not all. Others use$name
or$1
or such. Withodbc::odbc()
, however, it is always?
(no name, no number), regardless of the actual DBMS.Not sure if you are using this elsewhere, but the use of
<<-
(vice<-
or=
) can encourage bad habits and/or unreliable/unexpected results.It is not uncommon to use the same variable multiple times in a query. Unfortunately, you will need to include the variable multiple times, and order is important. For example,
samplequery <- dbGetQuery(con, " SELECT * FROM [TableName] WHERE OrderDate <= ? or (SomethingElse = ? and OrderDate > ?)0 ", params = list(PeriodEndDate, 99, PeriodEndDate))
If you have a list/vector of values and want to use SQL's
IN
operator, then you have two options, my preference being the first (for the reasons stated above):Create a string of question marks and paste into the query. (Yes, this is
paste
ing into the query, but we are not dealing with the risk of incorrectly single-quoting or double-quoting. SinceDBI
does not support any other mechanism, this is what we have.)MyDates <- c(..., ...) qmarks <- paste(rep("?", length(MyDates)), collapse=",") samplequery <- dbGetQuery(con, sprintf(" SELECT * FROM [TableName] WHERE OrderDate IN (%s) ", qmarks), params = as.list(MyDates))
glue_sql
supports expanding internally:MyDates <- c(..., ...) qry <- glue::glue_sql(" SELECT * FROM [TableName] WHERE OrderDate IN ({MyDates*})", .con=con) DBI::dbGetQuery(con, qry)