Home > Enterprise >  Is there a way to join two tables where the join condition joins on the row that most matches the va
Is there a way to join two tables where the join condition joins on the row that most matches the va

Time:10-27

I have two tables (main_table, codes_table). I need to join the two tables on column code. Example data below:

main_table

id customer code
1 Fred Acme Residential
2 Sue Acme Business
3 Bud Acme & Old State

codes_table

id code group
1 Acme X19CD
2 Acme Business G933C

My data is in Google Big Query and I am hoping to use native SQL. I am trying to come up with SQL that would allow me to join main_table to codes_table such that I would get the following output:

Results

cust_id customer code group
1 Fred Acme Residential X19CD
2 Sue Acme Business G933C
3 Bud Acme & Old State X19CD

Appreciate any thoughts on how one can accomplish this with SQL.

Essentially the code in the customer table could be various permutations of "Acme ". The issue I have had that trying the various ways I have to join to two tables ends up getting both code_table rows as they both start with "Acme". What I am trying to do is join the two where the main_table.code matches the most characters from the code_table.code.

CodePudding user response:

An option of course is the LIKE operator but to avoid the scenario you describe of multiple matches you'd also need to partition and filter the join condition (also known as windowing) in order to only join with the most LIKE match. I.E. include in your join a partition over code where partition_id = 1 to only grab the topmost value. Lot's of examples of windowing out there.

https://cloud.google.com/bigquery/docs/reference/standard-sql/window-function-calls#syntax

Another fun one to try might be to use the soundex() function which BigQuery does support. This creates a code for words that sound the same. You'd then join on the soundex() result instead of the actual codes. But my suspicion is that this won't be accurate enough and you'll get a bunch of false positives as well as potential multiple results which would require windowing as above.

https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#soundex

My true recommendation over either of those options would instead be either a data cleanup effort to standardize your codes. That's painful, but it's only going to get worse the longer you ignore it. OR, build a mapping table that essentially links records with the "version" they should join to. Then you'd always use that map table as a many-to-many within your join.

Lot's of workarounds, but only one true solution. Fix your data quality.

CodePudding user response:

Investigate the LIKE keyword. Investigate the use of the soundex() function in string matching. You may have to write your own function to do such matching. For example, if you want to match to strings on phonetic closeness, you can construct such a clause this way:

WHERE soundex(<fieldname>) = soundex(<input>)

I guess what I'm suggesting is:

WHERE mymatchingfunction(<fieldname>) = mymatchingfunction(<input>)

While this is computationally expensive, and you lose the benefit of indexing, it will deliver the desired results.

I hope this helps.

  • Related