HIVE/PIG JOIN Based on SUBSTRING match-CodePudding

I have a requirement where I need to JOIN a tweets table with person names, like filtering the tweets if it contains any person name. I have following data:

Tweets Table: (70 million records stored as a HIVE Table)

id	tweet
1	Cristiano Ronaldo greatest of all time
2	Brad Pitt movies
3	Random tweet without any person name

Person Names: (1.6 million names stored on HDFS as .tsv file)

id	person_name
1	Cristiano Ronaldo
2	Brad Pitt
3	Angelina Jolie

Expected Result:

id	tweet	person_name
1	Cristiano Ronaldo greatest of all time	Cristiano Ronaldo
2	Brad Pitt movies	Brad Pitt

What I've tried so far:

I have converted the person names .tsv file to HIVE table as well and then tried to join 2 tables with the following HIVE query:

SELECT * FROM tweets t INNER JOIN people p WHERE instr(t.tweet, p.person_name) > 0;

Tried with some sample data and it works fine. But when I try to run on entire data (70m tweets JOIN with 1.6m Person Names), it takes forever. Definitely doesn't look very efficient.

I wanted to try JOIN with PIG as well (as it is considered little more efficient than HIVE JOIN), where I can directly JOIN person names .tsv file tweets HIVE Table, but not sure how to JOIN based on substring in PIG. Please suggest me any alternatives that I can use?

CodePudding user response：

The idea is to create buckets so that we don't have to compare a lot of records. We are going to increase the number of records / joins to use multiple nodes to do work instead of a large crossjoin.--> WHERE instr(t.tweet, p.person_name) > 0;

I'd suggest splitting the tweets into individual words. Yes multiplying your record count way up.
Filtering out 'stopwords' or some other list of words that fit in memory.
Split names into (firstnames) and "last name"
Join tweets and names on "lastname" and instr(t.tweet, p.person_name) This should significantly reduce the size of data that you compare via a function. It will run faster.

If you are going to do this regularly consider creating tables with sort/bucket to really make things sizzle. (Make it faster as it can hopefully be Sort Merge Join ready.)

CodePudding user response：

It is worth trying Map-Join. Person table is small one and join with it can be converted to Map-Join operator if it fits into memory. Table will be loaded into each mapper memory.

Check EXPLAIN output. If it says that Common Join operator is on Reducer vertex, then try to increase mapper container memory and adjust map-join settings to convert to Map Join.

Settings responsible for Map Join (suppose the People table <2.5Gb) Try to bump mapjoin table size to 2.5Gb (check the actual size) and run explain again.

 set hive.auto.convert.join=true; --this enables map-join
 set hive.auto.convert.join.noconditionaltask = true;
 set hive.mapjoin.smalltable.filesize=2500000000; --size of table to fit in memory
 set hive.auto.convert.join.noconditionaltask.size=2500000000;

Also container size should be increased to avoid OOM (if you are on Tez):

set hive.tez.container.size=8192;  --container size in megabytes
set hive.tez.java.opts=-Xmx6144m;  --set this 80% of hive.tez.container.size

Figures are just an example. Try to adjust and check the EXPLAIN again, if it shows Map-Join operator, then check execution again, it should run much faster.