Pyspark: the most frequent words-CodePudding

I have a dataframe with columns :languages and words

sdf = spark.createDataFrame([
    ('eng', "cat"),
    ('eng', 'cat'),
    ('eng','dog'),
    ('eng','cat')
], ["lang", "text"])
sdf.show()

 ---- ---- 
|lang|text|
 ---- ---- 
| eng| cat|
| eng| cat|
| eng| dog|
| eng| cat|
 ---- ----

I want to group the top words by language(what i expect):

lang        count_words 
eng      {'cat':3,'dog':1}

I do this:

grouped = sdf.groupBy("lang").agg(count('text').alias("count_words"))

But i understand that I need to use create_map() here, but I don't quite understand how

CodePudding user response：

Group by both "text" and "lang" and then use the create_map() as you intended to. Here it is:

    from pyspark.sql.functions import create_map, count
    grouped = sdf.groupBy(["lang", "text"]).agg(create_map('text', count('text')).alias('count_words'))