Home > Back-end >  Using string matching like grepl in a dbplyr pipeline
Using string matching like grepl in a dbplyr pipeline

Time:04-23

dbplyr is very handy as it convert dplyr code into SQL. This works really well except when it doesn't. For example i am trying to subset rows by partially matching a string against values in a column. With exception of postgres, it appears as though this isn't yet implemented in dbplyr. Am I missing some {stringr} function that would accomplish the below:

library(dplyr, warn.conflicts = FALSE)
library(DBI)
data("flights", package = "nycflights13")

con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "flights", flights)

## works in dplyr
flights %>%
  filter(grepl("N", tailnum))
#> # A tibble: 334,264 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # … with 334,254 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

## no function translation to grepl
tbl(con, "flights") %>%
  filter(grepl("N", tailnum)) %>%
  collect()
#> Error: no such function: grepl

## Also not implemented for stringr
library(stringr)
tbl(con, "flights") %>%
  filter(str_detect(tailnum, "N")) %>%
  collect()
#> Error: str_detect() is not available in this SQL variant

dbDisconnect(con)

CodePudding user response:

We may use %like%

tbl(con, "flights") %>%
  dplyr::filter(tailnum %like% "%N%") %>%
  collect()

-output

# A tibble: 334,264 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
 1  2013     1     1      517            515         2      830            819        11 UA        1545 N14228  EWR    IAH        227
 2  2013     1     1      533            529         4      850            830        20 UA        1714 N24211  LGA    IAH        227
 3  2013     1     1      542            540         2      923            850        33 AA        1141 N619AA  JFK    MIA        160
 4  2013     1     1      544            545        -1     1004           1022       -18 B6         725 N804JB  JFK    BQN        183
 5  2013     1     1      554            600        -6      812            837       -25 DL         461 N668DN  LGA    ATL        116
 6  2013     1     1      554            558        -4      740            728        12 UA        1696 N39463  EWR    ORD        150
 7  2013     1     1      555            600        -5      913            854        19 B6         507 N516JB  EWR    FLL        158
 8  2013     1     1      557            600        -3      709            723       -14 EV        5708 N829AS  LGA    IAD         53
 9  2013     1     1      557            600        -3      838            846        -8 B6          79 N593JB  JFK    MCO        140
10  2013     1     1      558            600        -2      753            745         8 AA         301 N3ALAA  LGA    ORD        138
# … with 334,254 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>
  • Related