I have a dataframe with columns :languages and words
sdf = spark.createDataFrame([
('eng', "cat"),
('eng', 'cat'),
('eng','dog'),
('eng','cat')
], ["lang", "text"])
sdf.show()
---- ----
|lang|text|
---- ----
| eng| cat|
| eng| cat|
| eng| dog|
| eng| cat|
---- ----
I want to group the top words by language(what i expect):
lang count_words
eng {'cat':3,'dog':1}
I do this:
grouped = sdf.groupBy("lang").agg(count('text').alias("count_words"))
But i understand that I need to use create_map() here, but I don't quite understand how
CodePudding user response:
Group by both "text" and "lang" and then use the create_map() as you intended to. Here it is:
from pyspark.sql.functions import create_map, count
grouped = sdf.groupBy(["lang", "text"]).agg(create_map('text', count('text')).alias('count_words'))