I have a dataframe that contains a list of words and I need to merge them into a single sentence.
Dataframe:
temp = spark.createDataFrame([
(0, ['Julia', 'is', 'awesome']),
(2, ['Data-science', 'is','cool']),
(3, ['Machine','learning'])
], ["id", "words"])
# --- ------------------------
# |id |words |
# --- ------------------------
# |0 |[Julia, is, awesome] |
# |2 |[Data-science, is, cool]|
# |3 |[Machine, learning] |
# --- ------------------------
temp.printSchema()
# root
# |-- id: long (nullable = true)
# |-- words: array (nullable = true)
# | |-- element: string (containsNull = true)
I am applying the rdd.
rdd_df = temp.rdd.map(lambda x: [x['id'], ' '.join(x['words'])])
spark.createDataFrame(rdd_df, temp.schema).show(10, False)
# --- ---------------------------------------------------------
# |id |words |
# --- ---------------------------------------------------------
# |0 |[ ' J u l i a ' , ' i s ' , ' a w e s o m e ' ] |
# |2 |[ ' D a t a - s c i e n c e ' , ' i s ' , ' c o o l ' ]|
# |3 |[ ' M a c h i n e ' , ' l e a r n i n g ' ] |
# --- ---------------------------------------------------------
But the above code is not returning the desired output. Is there any other solution that we can apply without the use of RDD?
Desired output:
--- --------------------
|id |words |
--- --------------------
|0 |Julia is awesome |
|1 |Data-science is cool|
|2 |Machine |
--- --------------------
CodePudding user response:
If you have a list of words (an array of strings), you can combine them using array_join
:
from pyspark.sql import functions as F
temp = spark.createDataFrame([
(0, ['Julia', 'is', 'awesome']),
(1, ['Data-science', 'is','cool']),
(2, ['Machine','learning'])
], ["id", "words"])
temp = temp.withColumn('words', F.array_join('words', ' '))
temp.show()
# --- --------------------
# | id| words|
# --- --------------------
# | 0| Julia is awesome|
# | 1|Data-science is cool|
# | 2| Machine learning|
# --- --------------------