Home > Mobile >  Using join to find similarities between two datasets containing strings in PySpark
Using join to find similarities between two datasets containing strings in PySpark

Time:06-10

I'm trying to match text records in two datasets, mostly via using PySpark (not using libraries such as BM25 or NLP techniques as much as I can for now -using Spark ML and SparkNLP libraries are fine).

I'm towards finishing the pre-processing phase. I've cleaned the text in both datasets, tokenized it and created bi-Grams (stored in a column called biGrams in both datasets).

Also, below is the schema of the two datasets. Please note the different IDs.

df1 columns:
int_id
fullText
biGrams

df2 columns:
ext_id
fullText
biGrams

My plan is to outer join the two datasets, explode on the biGrams, and then return the similar ones using something like inner join.

Now, I want to join (outer join) the two datasets using biGram columns with the code below:

full_similarity_df = df1.join(df2, on=[df1.biGrams == df2.biGrams], how = 'outer')

But I get a long error message with tracebacks, with the following error as the main error:

Found duplicate column(s) when inserting into file

Why is that, and how can I rectify this?

Please let me know if you can think of any other suggestions for tackling this problem.

Thanks in advance, Rfai

CodePudding user response:

The reason is when you display full_similarity_df you will see 2 fullText and biGrams columns like below

 ------ ----------- ------- ------ -------- ------- 
|int_id|   fulltext|bigrams|ext_id|fulltext|bigrams|
 ------ ----------- ------- ------ -------- ------- 
|     1|abc def fhg|abc def|     1| abc def|abc def|
|     2|abc def fhg|abc fhg|  null|    null|   null|
 ------ ----------- ------- ------ -------- ------- 

so if you give and alias to them then you won't get the duplicate column name issue

full_similarity_df = df1.join(df2, on=[df1.bigrams == df2.bigrams], how = 'outer').select("int_id",df1.fulltext.alias("df1_fulltext"),df1.bigrams.alias("df1_bigrams"),"ext_id",df2.fulltext.alias("df2_fulltext"),df2.bigrams.alias("df2_bigrams"))
full_similarity_df.show()
 ------ ------------ ----------- ------ ------------ ----------- 
|int_id|df1_fulltext|df1_bigrams|ext_id|df2_fulltext|df2_bigrams|
 ------ ------------ ----------- ------ ------------ ----------- 
|     1| abc def fhg|    abc def|     1|     abc def|    abc def|
|     2| abc def fhg|    abc fhg|  null|        null|       null|
 ------ ------------ ----------- ------ ------------ ----------- 
  • Related