Home > Software engineering >  R, order of filtering when querying from a SQL database
R, order of filtering when querying from a SQL database

Time:06-13

I am trying to extract a subset of a table located in a SQL database, and the result I get is different depending on the order of applied filters.

I need to apply three filters, say "YEAR %in% years", "SYSTEM != 'G'", and "CODE == code".

I've tried putting all three conditions into the SQL statement:

table1 <- dbGetQuery(
  conn = connection, 
  statement = paste0("SELECT * FROM database WHERE YEAR IN (",
                     paste0(years, collapse = ","), ") AND SYSTEM != 'G'" AND CODE = ", code))

But the output I got was an empty table.

Then I tried putting two of the conditions into the statement, and then filtering the output using the third condition:

table2 <- dbGetQuery(
  conn = connection, 
  statement = paste0("SELECT * FROM database WHERE YEAR IN (",
                     paste0(years, collapse = ","), ") AND SYSTEM != 'G'")) %>%
  filter(CODE == code)

table3 <- dbGetQuery(
  conn = connection, 
  statement = paste0("SELECT * FROM database WHERE SYSTEM != 'G' AND CODE = ", code)) %>%
  filter(YEAR %in% years)

As a result, table2 contained 914 rows, while table3 contained 184 rows, even though the results should be exactly the same.

I expected the results from each of the three code chunks to be the same, so I'm a bit lost as to what the issue might be.

CodePudding user response:

The parameterized (and safer) version of your query:

years <- 2022:2024
qry <- paste("SELECT * FROM database WHERE YEAR IN (", paste(rep("?", length(years)), collapse = ","), ") AND SYSTEM != 'G' AND CODE = ?")
qry
# [1] "SELECT * FROM database WHERE YEAR IN ( ?,?,? ) AND SYSTEM != 'G' AND CODE = ?"

table <- dbGetQuery(conn = connection, statement = qry, params = c(years, list(code)))

One might go a step further and parameterize the constant 'G' in your query. If it's always 'G', then you should be okay, but if that is determined externally then it should also be parameterized.

Note that the alignment of ? in the query must align perfectly with the data in the params= list. If one needs to use the same variable multiple times in the SQL query, then it must appear multiple times in the params list, in the correct order with all other parameters to be bound. The params= list cannot include unused values. While all of this might sound over-constraining and/or complicated, it is really meant for the protection of the query and is worth it. Too many times I've had a (non-parameterized) query mysteriously break with no clear clue, due to ANSI or unquoted spaces or something similar. (Fortunately, it's always only broken the query and never corrupted a table.)

CodePudding user response:

I agree with charlieface that you should ideally use parameterised queries, but try this for the time being:

# If CODE is varchar: code => character scalar
code <- '0001'

# Try this: table1 => data.frame
table1 <- dbGetQuery(
  conn = connection, 
  statement = paste0(
    "SELECT * FROM database WHERE YEAR IN (",
    paste0(
      years, 
      collapse = ","
    ), 
    ") AND SYSTEM != 'G' AND CODE = '", 
    code,
    "'"
  )
)

# If CODE is an integer: code => integer scalar
code <- 1

# Try this:  table1 => data.frame
table1 <- dbGetQuery(
  conn = connection, 
  statement = paste0(
    "SELECT * FROM database WHERE YEAR IN (",
    paste0(
      years, 
      collapse = ","
    ), 
    ") AND SYSTEM != 'G' AND CODE = '", 
    code,
    "'"
  )
)

Dummy input data:

# Generate some dummy data: years => integer vector
years <- seq(from = 2000, to = 2022, by = 1)                     
  • Related