Home > Net >  Hive SQL left join based on substring search from a second table
Hive SQL left join based on substring search from a second table

Time:10-20

I need to create a table that joins queries with a set of keywords that we are targeting within our content. The Google Search table is search queries and impressions to the blog and the keyword table is an internally created table of keywords and a value (1-10) we assign to them. I need to join these two in order to get the impressions based on keywords.

something similar to this dummy code

   select a.*, b.* 
     from google_search_table a
left join Keyword_Table b
       on contains(a.query, b.keyword)

Google Search Table

Query Impressions
Most minty toothpaste 8274
best bath soap 4375
How to cook tomatoes 1734
top softest toilet paper 892
top bushcraft skills 9284

Keyword Table

Keyword value
toothpaste 2
soap 5
tomatoes 3
toilet paper 3

Intended Result Table

Keyword Impressions
toothpaste 8274
soap 4375
tomatoes 1734
toilet paper 892
null 9284

CodePudding user response:

Here's the SQL to solve your problem.

SELECT * 
from ( select Keyword, Impressions
from google_search_table
lateral view explode(split(Query,' ')) t1 as Keyword ) a
where 
  a.Keyword is in (select Keyword from Keyword_Table)

If that performs reasonably, great use it. I'm assuming the keyword table is small.

If it blows the memory stack you may want to look at some type of broadcast join or map side join.

CodePudding user response:

I'd suggest trying pyspark/spark

resultTable = google_search_table.join(Keyword_Table, google_search_table.Query.contains(Keyword_Table.Keyword), how='left')

You can see how someone else solved a similar issue using pyspark here.

  • Related