There are two tables A and B.
They both contain titles referencing the same thing, but the naming conventions are always different and cannot be predicted.
The only way to match titles is to find low difference scores on a number of columns, but for now only the title is important.
There are only about 10,000 records in each table currently. Using the standard Cross Join and EditDistance combination works fine now. But I've already noticed performance decreases as the number of records grow.
Is there a more performant way of achieving the desired result of finding partial matches between strings in different tables?
I apologize if there is an obvious answer. The few posts that deviate from the editdistance solution still assume cross joining : https://community.snowflake.com/s/question/0D50Z00008zPLLxSAO/join-with-partial-string-match
CodePudding user response:
You should use a blocking key strategy to help cut down on the number of pairs generated. This document explains this strategy and other techniques for Fuzzy Matching on Snowflake. https://drive.google.com/file/d/1FuxZnXojx71t-1kNOaqg1ErrEiiATdsM/view?usp=sharing
CodePudding user response:
As per Ryan point, the way to avoid comparing all values is to prune "what values are joined".
In other domains (spatial) we found quantizing the GPS down and then joining the 8 surrounding buckets, while made for "more comparisons for things a human could see where near" eliminated all the compares for the things that "clearly are very far away".
Like most expensive computation, you want to prune as much as you can without missing things you want to include. Which is to say false positives are fine, but false negatives are very bad.
So how you batch/bucket/prune you data is very application data specific.