Home > Blockchain >  Re-Writing "Fuzzy Join" Functions from R to SQL
Re-Writing "Fuzzy Join" Functions from R to SQL

Time:12-14

In the R programming language, I am interested in performing a "fuzzy join" and passing this through a SQL Connection:

library(fuzzyjoin)
library(dplyr)

library(RODBC)
library(sqldf)

con = odbcConnect("some name", uid = "some id", pwd = "abc")

sample_query = sqlQuery( stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) %>%
  filter(date_1 >= date_2, date_1 <= date_3) )

view(sample_query)

However, I do not think this is possible, because the function which us being used for the "fuzzy join" (stringdist_inner_join) is not supported by Netezza.

I tried to find the source code for this "fuzzy join" function, and found it over here: https://rdrr.io/cran/fuzzyjoin/src/R/stringdist_join.R

My Question: Does anyone know if it is possible to (manually) convert this "fuzzy join" function into an SQL format that will be recognized by Netezza? Are there any quick ways to re-write this function (stringdist_inner_join) such that it can be recognized by Netezza? Are there any pre-existing ways to do this?

Right now I can only execute "sample_query" on locally - re-writing this function (stringdist_inner_join) would let perform the "sample_query" much faster.

Does anyone know if this is possible?

Note:

My data looks like this:

table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id1 = c("0123", "1233", "125  .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

CodePudding user response:

Based on your other post about this issue, a solution to the question of how to structure the SQL query was solved:

SAS: Fuzzy Joins

select a.*, b.* 
from table_a a
inner join table_b b
  on (a.date_1 between b.date_2 and b.date_3)
      and (le_dst(a.id1, b.id1) = 1 or a.id2 = b.id2)

To get this to run in an R script, I would recommend using dbplyr and creating this using tbl so you can continue doing basic manipulation of it as if it were a data.frame and dbplyr will translate it into SQL (at least basic commands), then combine everything into a query and eventually pull the data from the query with the collect() function.

Edit: Just a note, the tbl command will start building a SQL statement and get column names, but it won't run it to pull data until you enter collect() at which point, R will send the query to the server, the server will run the query and send the data.

Just keep this in mind because if dbplyr can't translate something to SQL, it will assume it's a SQL command and try to send it, so you won't know there's an error until you try to collect. For example, a function from the stringr package, str_dectect, isn't implemented in dbplyr and so dbplyr would send that command to the database, which would throw an error because it doesn't know what that is, but only after running collect(). Check out the dbplyr page linked above for details.


library(dbplyr)

new_con<- dbConnect(
  odbc(),
  Driver= "ODBC Driver 17 for SQL Server (as an example)",
  Server = "Server name here",
  uid = "some_id",
  pwd = "abc"
)


sample_query<- dbplyr::tbl(
  new_con,
  dbplyr::sql(
  "select a.*, b.* 
    from table_a a
    inner join table_b b
      on (a.date_1 between b.date_2 and b.date_3)"

sample_data<-sample_query %>%
 filter(silly_example==TRUE) %>%
 collect()  

CodePudding user response:

I agree with @Roger-123's approach. But here is a variation that might assist:

Assuming you are using remote connections to access the Netezza database, you could do this using dbplyr as follows:

remote_1 = tbl(con, "table_1_name")
remote_2 = tbl(con, "table_2_name")

# create dummy column
remote_1 = mutate(remote_1, ones = 1)
remote_2 = mutate(remote_2, ones = 1)

output = remote_1 %>%
  # cross_join
  inner_join(remote_2, by = "ones", prefix = c("_1","_2")) %>%
  # calculate Levenshtein distance
  mutate(distance = le_dst(id1, id2)) %>%
  # filter to close matches
  filter(distance <= 2)

Notes:

  • dbplyr does not allow for complex conditions in its joins. Hence we do the most general join possible and then filter.
  • If you also want joins by date, then you can put them into the inner_join if the conditions are simple, or create another filter condition if they are complex.
  • le_dst is not an R function and there is no dbplyr translation for it, so dbplyr will pass it to the server as-is.
  • Netezza accepts two distance functions for text: le_dst and dle_dst. You can use whichever you please here.
  • Output is a query, it will act like a table but it is being generated/calculated on the fly. It has not been written to disk or loaded into R memory. Depending on your application you will want to store/save this.
  • Related