I have two PySpark dataframes of the following structure. I would like to perform cross join and calculate cosine similarity. The qry_emb is a string column with comma separated values.
How to convert this string into dense vector?
df.printSchema()
# root
# |-- query: string (nullable = true)
# |-- qry_emb: string (nullable = true)
CodePudding user response:
To convert string to vector, first convert your string to array (split
), then use array_to_vector
from pyspark.sql import functions as F
from pyspark.ml.functions import array_to_vector
df = df.withColumn('qry_emb', array_to_vector(F.split('qry_emb', ',[ ]*').cast('array<double>')))