TypeError while tokenizing a column in Spark dataframe-CodePudding

I'm trying to tokenize a 'string' column from a spark dataset.

The spark dataframe is as follows:

df: 
index ---> Integer 
question ---> String

This is how I'm using the spark tokenizer:

Quest = df.withColumn("question", col("Question").cast(StringType()))
tokenizer = Tokenizer(inputCol=Quest, outputCol="question_parts")

But I get the following error:

Invalid param value given for param "inputCol". Could not convert <class 'pyspark.sql.dataframe.DataFrame'> to string type

I also substituted the first line of my code with following codes, but they didn't resolve this error either:

Quest = df.select(concat_ws(" ",col("question")))

and

Quest= df.withColumn("question", concat_ws(" ",col("question")))

What's my mistake here?

CodePudding user response：

The mistake is the second line. df.withColumn() returns a dataframe with the column you just created appended. In the second line, inputCol="question" should give you what you need. You then need to transform your dataframe using the tokenizer. Try:

df = df.withColumn("question", col("Question").cast(StringType()))
tokenizer = Tokenizer(inputCol="Question", outputCol="question_parts")
tokenizer.Transform(df)