I have a df looks like this:
----- ------- -----
|docId|vocabId|count|
----- ------- -----
| 3| 3| 600|
| 2| 3| 702|
| 1| 2| 120|
| 2| 5| 200|
| 2| 2| 500|
| 3| 1| 100|
| 3| 5| 2000|
| 3| 4| 122|
| 1| 3| 1200|
| 1| 1| 1000|
----- ------- -----
I want to output the max count of vocabId and the docId it belongs to. I did this:
val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count"))
and got this:
------- ----------
|vocabId| count |
------- ----------
| 1| 1000|
| 3| 1200|
| 5| 2000|
| 4| 122|
| 2| 500|
------- ----------
How do I add the docId at the front??? It should looks something like this(the order is not important):
----- ------- -----
|docId|vocabId|count|
----- ------- -----
| 2| 2| 500|
| 3| 5| 2000|
| 3| 4| 122|
| 1| 3| 1200|
| 1| 1| 1000|
----- ------- -----
CodePudding user response:
You can do self join with docwords over count and vocabId something like below
val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count")).join(docwords,Seq("vocabId","count"))