Home > other >  How to merge the list of words in PySpark dataframe?
How to merge the list of words in PySpark dataframe?


I have a dataframe that contains a list of words and I need to merge them into a single sentence.


temp = spark.createDataFrame([
    (0, ['Julia', 'is', 'awesome']),
    (2, ['Data-science', 'is','cool']),
    (3, ['Machine','learning'])
], ["id", "words"])

#  --- ------------------------ 
# |id |words                   |
#  --- ------------------------ 
# |0  |[Julia, is, awesome]    |
# |2  |[Data-science, is, cool]|
# |3  |[Machine, learning]     |
#  --- ------------------------ 

# root
#  |-- id: long (nullable = true)
#  |-- words: array (nullable = true)
#  |    |-- element: string (containsNull = true)

I am applying the rdd.

rdd_df = temp.rdd.map(lambda x: [x['id'], ' '.join(x['words'])])
spark.createDataFrame(rdd_df, temp.schema).show(10, False)

#  --- --------------------------------------------------------- 
# |id |words                                                    |
#  --- --------------------------------------------------------- 
# |0  |[ ' J u l i a ' ,   ' i s ' ,   ' a w e s o m e ' ]      |
# |2  |[ ' D a t a - s c i e n c e ' ,   ' i s ' , ' c o o l ' ]|
# |3  |[ ' M a c h i n e ' , ' l e a r n i n g ' ]              |
#  --- --------------------------------------------------------- 

But the above code is not returning the desired output. Is there any other solution that we can apply without the use of RDD?

Desired output:

 --- -------------------- 
|id |words               |
 --- -------------------- 
|0  |Julia is awesome    |
|1  |Data-science is cool|
|2  |Machine             |
 --- -------------------- 

CodePudding user response:

If you have a list of words (an array of strings), you can combine them using array_join:

from pyspark.sql import functions as F
temp = spark.createDataFrame([
    (0, ['Julia', 'is', 'awesome']),
    (1, ['Data-science', 'is','cool']),
    (2, ['Machine','learning'])
], ["id", "words"])

temp = temp.withColumn('words', F.array_join('words', ' '))

#  --- -------------------- 
# | id|               words|
#  --- -------------------- 
# |  0|    Julia is awesome|
# |  1|Data-science is cool|
# |  2|    Machine learning|
#  --- -------------------- 
  • Related