I have a dataframe as follows:
data = [
("100", 'the boy wants go to school'),
("200", 'he is a good boy'),
("300", 'he likes to play football in the school')
]
schema = ['id', 'description']
df = spark.createDataFrame(data, schema=schema)
I want to create an external dictionary (i.e. not as a new column; I need to access the dictionary separately later) from the words in each row in the 'description' column.
Desired output i.e. my dictionary should be:
the: 2
boy: 1
wants: 1
he: 2
school: 2
play: 1
...
I know how to do this using pandas. How can I do it using PySpark?
(I've tried MapType, udf etc. but could not succeed.)
Thanks in advance!
CodePudding user response:
The following should do:
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [
("100", "the boy wants go to school"),
("200", "he is a good boy"),
("300", "he likes to play football in the school"),
]
schema = ["id", "description"]
df = spark.createDataFrame(data, schema=schema)
df = (
df.withColumn("word", f.explode(f.split(f.col("description"), " ")))
.groupBy("word")
.count()
.sort("count", ascending=False)
)
res = df.rdd.map(lambda row: row.asDict()).collect()
res = {d["word"]: d["count"] for d in res}
print(res)
Another way for the export of the dict can be to convert your Spark DataFrame toPandas as demonstrated here