I have the following dataframe.
docid_ province 123 zhejiang 123 zhejiang 123 shanghai 456. zhejiang
I want to find the most frequent province for each docid. so I first groupBy docid and then count the frequency. But I got the error of 'Column' object is not callable
This is my code:
uin_feature_province_count = uin_feature.groupBy("docid_").\
agg(col("province").groupBy("province").count().orderBy(col("province").desc).collect()(0).get(0).alias("most_province"))
CodePudding user response:
I haven't tried to fix your code but if just you need most common province for each doc id you can try using row number over count of province as done below
uin_feature_province_count = uin_feature.groupBy("docid_","province").agg(row_number().over(Window.partitionBy("docid_").orderBy(count("province").desc())).alias("rank")).filter(col("rank")==1).select("docid_","province")