How to compute cosine similarity between two texts in presto?-CodePudding

Hello everyone: I wanted to use COSINE_SIMILARITY in Presto SQL to compute the similarity between two texts. Unfortunately, COSINE_SIMILARITY does not take the texts as the inputs; it takes maps instead. I am not sure how to convert the texts into those maps in presto. I want the following, if we have a table like this:

id	text1	text2
1	a b b	b c

Then we can compute the cosine similarity as:

COSINE_SIMILARITY(
    MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 0]),
    MAP(ARRAY['a', 'b', 'c'], ARRAY[0, 1, 1])
)

i.e., two texts combined has three words: 'a', 'b', and 'c'; text1 has 1 count of 'a', 2 counts of 'b', and 0 count of 'c', which goes as the first MAP; similarly, text2 has 0 count of 'a', 1 count of 'b', and 1 count of 'c', which goes as the second MAP.

The final table should look like this:

id	text1	text2	all_unique_words	map1	map2	similarity
1	a b b	b c	[a b c]	[1, 2, 0]	[0, 1, 1]	0.63

How can we convert two texts into two such maps in presto? Thanks in advance!

CodePudding user response：

Use split to transform string into array and then depended on Presto version either use unnest histogram trick or array_frequency:

-- sample data
with dataset(id, text1, text2) as (values (1, 'a b b', 'b c'))

-- query
select id, COSINE_SIMILARITY(histogram(t1), histogram(t2))
from dataset,
unnest (split(text1, ' '), split(text2, ' ')) as t(t1, t2)
group by id;

Output:

id	_col1
1	0.6324555320336759