I'm trying to match text records in two datasets, mostly via using PySpark (not using libraries such as BM25 or NLP techniques as much as I can for now -using Spark ML and SparkNLP libraries are fine).
I'm towards finishing the pre-processing phase. I've cleaned the text in both datasets, tokenized it and created bi-Grams (stored in a column called biGrams in both datasets).
Also, below is the schema of the two datasets. Please note the different IDs.
df1 columns:
int_id
fullText
biGrams
df2 columns:
ext_id
fullText
biGrams
My plan is to outer join the two datasets, explode on the biGrams, and then return the similar ones using something like inner join.
Now, I want to join (outer join) the two datasets using biGram columns with the code below:
full_similarity_df = df1.join(df2, on=[df1.biGrams == df2.biGrams], how = 'outer')
But I get a long error message with tracebacks, with the following error as the main error:
Found duplicate column(s) when inserting into file
Why is that, and how can I rectify this?
Please let me know if you can think of any other suggestions for tackling this problem.
Thanks in advance, Rfai
CodePudding user response:
The reason is when you display full_similarity_df you will see 2 fullText and biGrams columns like below
------ ----------- ------- ------ -------- -------
|int_id| fulltext|bigrams|ext_id|fulltext|bigrams|
------ ----------- ------- ------ -------- -------
| 1|abc def fhg|abc def| 1| abc def|abc def|
| 2|abc def fhg|abc fhg| null| null| null|
------ ----------- ------- ------ -------- -------
so if you give and alias to them then you won't get the duplicate column name issue
full_similarity_df = df1.join(df2, on=[df1.bigrams == df2.bigrams], how = 'outer').select("int_id",df1.fulltext.alias("df1_fulltext"),df1.bigrams.alias("df1_bigrams"),"ext_id",df2.fulltext.alias("df2_fulltext"),df2.bigrams.alias("df2_bigrams"))
full_similarity_df.show()
------ ------------ ----------- ------ ------------ -----------
|int_id|df1_fulltext|df1_bigrams|ext_id|df2_fulltext|df2_bigrams|
------ ------------ ----------- ------ ------------ -----------
| 1| abc def fhg| abc def| 1| abc def| abc def|
| 2| abc def fhg| abc fhg| null| null| null|
------ ------------ ----------- ------ ------------ -----------