Home > Software design >  Selecting Data Using Conditions Stored in a Variable
Selecting Data Using Conditions Stored in a Variable

Time:06-23

Pretend I have this table on a server:

library(dplyr)
library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

iris$id = 1:nrow(iris)
dbWriteTable(con, "iris", iris)

I want to select some some random rows from this dataset - suppose I create an R variable that contains the random rows that I want to select:

rows_to_select = sample.int(10, 5, replace = TRUE)
[1] 1 1 8 8 7

I then tried to select these rows from my table - but this "rows_to_select" variable is not being recognized for some reason:

DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (rows_to_select) limit 100;")

Error: no such column: rows_to_select

This code works fine if I manually specify which rows I want (e.g. I want the first row, and the fifth row selected twice):

#works - but does not return the 5th row twice
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (1,5,5) limit 100;")
  • Does anyone know how to fix this?

Thank you!

CodePudding user response:

In general, just including rows_to_select in a query is not going to know to reach out of the SQLite environment and "invade" the R environment (completely different!) and look for a variable. (For that matter, why doesn't select a.* ... find dplyr::select?) This is the case both for pragmatic reasons and security (though mostly pragmatic).

You may want to consider parameterized queries vice constructing query strings manually. In addition to security concerns about malicious SQL injection (e.g., XKCD's Exploits of a Mom aka "Little Bobby Tables"), it is also a concern for malformed strings or Unicode-vs-ANSI mistakes, even if it's one data analyst running the query. DBI supports parameterized queries.

Long story short, try this:

set.seed(42)
rows_to_select = sample.int(10, 5, replace = TRUE)
rows_to_select
# [1]  1  5  1  9 10
qmarks <- paste(rep("?", length(rows_to_select)), collapse = ",")
qmarks
# [1] "?,?,?,?,?"

DBI::dbGetQuery(con, paste(
  "select a.*
   from (select *, row_number() over (order by id) as rnum from iris) a
   where a.rnum in (", qmarks, ") limit 100;"),
  params = as.list(rows_to_select))
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species id rnum
# 1          5.1         3.5          1.4         0.2  setosa  1    1
# 2          5.0         3.6          1.4         0.2  setosa  5    5
# 3          4.4         2.9          1.4         0.2  setosa  9    9
# 4          4.9         3.1          1.5         0.1  setosa 10   10

In this case it is rather trivial, but if you have a more complicated query where you use question marks ("bindings") at different places in the query, the order must align perfectly with the elements of the list assigned to the params= argument of dbGetQuery.


Alternative: insert a temp table with your candidate values, then left-join against it.

dbWriteTable(con, "mytemp", data.frame(rnum = rows_to_select), temporary = TRUE)

DBI::dbGetQuery(con, 
  "select i.* from mytemp m left join iris i on i.id=m.rnum")
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
# 1          5.1         3.5          1.4         0.2  setosa  1
# 2          5.0         3.6          1.4         0.2  setosa  5
# 3          5.1         3.5          1.4         0.2  setosa  1
# 4          4.4         2.9          1.4         0.2  setosa  9
# 5          4.9         3.1          1.5         0.1  setosa 10

DBI::dbExecute(con, "drop table mytemp")
# [1] 0
  •  Tags:  
  • sql r
  • Related