Home > Mobile >  Convert string into vector to calculate cosine similarity
Convert string into vector to calculate cosine similarity

Time:06-16

I have two PySpark dataframes of the following structure. I would like to perform cross join and calculate cosine similarity. The qry_emb is a string column with comma separated values.

How to convert this string into dense vector?

Pyspark dataframe

df.printSchema()
# root
# |-- query: string (nullable = true)
# |-- qry_emb: string (nullable = true)

CodePudding user response:

To convert string to vector, first convert your string to array (split), then use array_to_vector

from pyspark.sql import functions as F
from pyspark.ml.functions import array_to_vector
df = df.withColumn('qry_emb', array_to_vector(F.split('qry_emb', ',[ ]*').cast('array<double>')))
  • Related