I need to create a table that joins queries with a set of keywords that we are targeting within our content. The Google Search table is search queries and impressions to the blog and the keyword table is an internally created table of keywords and a value (1-10) we assign to them. I need to join these two in order to get the impressions based on keywords.
something similar to this dummy code
select a.*, b.*
from google_search_table a
left join Keyword_Table b
on contains(a.query, b.keyword)
Google Search Table
Query | Impressions |
---|---|
Most minty toothpaste | 8274 |
best bath soap | 4375 |
How to cook tomatoes | 1734 |
top softest toilet paper | 892 |
top bushcraft skills | 9284 |
Keyword Table
Keyword | value |
---|---|
toothpaste | 2 |
soap | 5 |
tomatoes | 3 |
toilet paper | 3 |
Intended Result Table
Keyword | Impressions |
---|---|
toothpaste | 8274 |
soap | 4375 |
tomatoes | 1734 |
toilet paper | 892 |
null | 9284 |
CodePudding user response:
Here's the SQL to solve your problem.
SELECT *
from ( select Keyword, Impressions
from google_search_table
lateral view explode(split(Query,' ')) t1 as Keyword ) a
where
a.Keyword is in (select Keyword from Keyword_Table)
If that performs reasonably, great use it. I'm assuming the keyword table is small.
If it blows the memory stack you may want to look at some type of broadcast join or map side join.
CodePudding user response:
I'd suggest trying pyspark/spark
resultTable = google_search_table.join(Keyword_Table, google_search_table.Query.contains(Keyword_Table.Keyword), how='left')
You can see how someone else solved a similar issue using pyspark here.