Home > OS >  PySpark create external dictionary from a dataframe column of strings
PySpark create external dictionary from a dataframe column of strings

Time:12-21

I have a dataframe as follows:

data = [
    ("100", 'the boy wants go to school'),
    ("200", 'he is a good boy'),
    ("300", 'he likes to play football in the school')
]
schema = ['id', 'description']
df = spark.createDataFrame(data, schema=schema)

I want to create an external dictionary (i.e. not as a new column; I need to access the dictionary separately later) from the words in each row in the 'description' column.

Desired output i.e. my dictionary should be:

the: 2
boy: 1
wants: 1
he: 2
school: 2
play: 1
...

I know how to do this using pandas. How can I do it using PySpark?

(I've tried MapType, udf etc. but could not succeed.)

Thanks in advance!

CodePudding user response:

The following should do:

import pyspark.sql.functions as f
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [
    ("100", "the boy wants go to school"),
    ("200", "he is a good boy"),
    ("300", "he likes to play football in the school"),
]
schema = ["id", "description"]
df = spark.createDataFrame(data, schema=schema)

df = (
    df.withColumn("word", f.explode(f.split(f.col("description"), " ")))
    .groupBy("word")
    .count()
    .sort("count", ascending=False)
)

res = df.rdd.map(lambda row: row.asDict()).collect()
res = {d["word"]: d["count"] for d in res}

print(res)

Another way for the export of the dict can be to convert your Spark DataFrame toPandas as demonstrated here

  • Related