Home > OS >  Filter data in one SQLite database table by ids from another table in the same database
Filter data in one SQLite database table by ids from another table in the same database

Time:04-02

I have a local SQLite database with two tables. I want to make a query from one table filtering rows by ids from another table in the same database using dbplyr.

As an example, here is what I would do without a database involved:

# install.packages("tidyverse")
library(tidyverse)

data <-
  tibble(id = seq(1, 8),
         data = LETTERS[seq(1, 8)])

condition <- 
  tibble(id = seq(1, 8),
         group = c(rep("low", 4), rep("high", 4)))

data %>% 
  filter(id %in% pull(condition %>% 
                        filter(group == "high") %>% 
                        select(id)))

which gives me

# A tibble: 4 × 2
     id data 
  <int> <chr>
1     5 E    
2     6 F    
3     7 G    
4     8 H

Using the same code with tables in the database instead of tibbles in the memory

# install.packages("RSQLite")
library(RSQLite)

test_db <- 
  dbConnect(RSQLite::SQLite(), "test_db.sqlite")

dbWriteTable(test_db, "data_tbl", data)
dbWriteTable(test_db, "condition_tbl", condition)

dbListTables(test_db)

data_db <- 
  tbl(test_db, "data_tbl")

condition_db <- 
  tbl(test_db, "condition_tbl")      

data_db %>% 
  filter(id %in% 
           pull(condition_db %>% 
                  filter(group == "high") %>% 
                  select(id)))

produces the following error:

Error in UseMethod("escape") : 
 not applicable method for 'escape' applied on object of class "c('tbl_SQLiteConnection', 'tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"

When I load the ids in the memory first the query works without a problem.

detour <- 
  pull(condition_db %>% 
         filter(group == "high") %>% 
         select(id) %>% 
         collect())

data_db %>% 
  filter(id %in% detour)

# A tibble: 4 × 2
     id data 
  <int> <chr>
1     5 E    
2     6 F    
3     7 G    
4     8 H

Since pull() seems to be the problem here - how can I make such queries without the "detour" provided above?

CodePudding user response:

I use a semi-join for this problem when working with dbplyr. A semi-join between two tables returns every record from the first table where there is at least one match with a record in the second table. (An anti-join is similar, returning where there are no matches in the second table.)

This would look like:

prepared_condition = condition %>%
  filter(group == "high")

output = data %>%
  semi_join(prepared_condition, by = "id")
  • Related