I'm trying to tokenize a 'string' column from a spark dataset.
The spark dataframe is as follows:
df:
index ---> Integer
question ---> String
This is how I'm using the spark tokenizer:
Quest = df.withColumn("question", col("Question").cast(StringType()))
tokenizer = Tokenizer(inputCol=Quest, outputCol="question_parts")
But I get the following error:
Invalid param value given for param "inputCol". Could not convert <class 'pyspark.sql.dataframe.DataFrame'> to string type
I also substituted the first line of my code with following codes, but they didn't resolve this error either:
Quest = df.select(concat_ws(" ",col("question")))
and
Quest= df.withColumn("question", concat_ws(" ",col("question")))
What's my mistake here?
CodePudding user response:
The mistake is the second line. df.withColumn()
returns a dataframe with the column you just created appended. In the second line, inputCol="question"
should give you what you need. You then need to transform your dataframe using the tokenizer. Try:
df = df.withColumn("question", col("Question").cast(StringType()))
tokenizer = Tokenizer(inputCol="Question", outputCol="question_parts")
tokenizer.Transform(df)