I have a dataframe look like this below
id pub_date version unique_id c_id p_id type source
lni001 20220301 1 64WP-UI-POLI 002 P02 org internet
lni001 20220301 1 64WP-UI-POLI 002 P02 org internet
lni001 20220301 1 64WP-UI-POLI 002 P02 org internet
lni001 20220301 2 64WP-UI-CFGT 012 K21 location internet
lni001 20220301 2 64WP-UI-CFGT 012 K21 location internet
lni001 20220301 3 64WP-UI-CFGT 012 K21 location internet
lni001 20220301 3 64WP-UI-POLI 002 P02 org internet
lni001 20220301 85 64WP-UI-POLI 002 P02 org internet
lni001 20220301 85 64WP-UI-POLI 002 P02 org internet
lni001 20220301 5 64WP-UI-CFGT 012 K21 location internet
lni002 20220301 1 64WP-UI-CFGT 012 K21 location internet
::
::
I want to groupby id column and only keep the highest number from version column but here is a catch, I also need to take into consideration for the type column (which only have two types, org or location). The final dataframe will look like this below
id pub_date version unique_id c_id p_id type source
lni001 20220301 85 64WP-UI-POLI 002 P02 org internet
lni001 20220301 85 64WP-UI-POLI 002 P02 org internet
lni001 20220301 5 64WP-UI-CFGT 012 K21 location internet
lni002 20220301 14 64WP-UI-CFGT 012 K21 location internet
::
::
My current approach is separate the dataframe into two different ones, first one is org under type column, the other one is location under type column.Then I am using groupby, withColumn
but my dataframe is huge. And I am wondering are there more efficient ways to do this maybe in one line of code? Rather than need to separate them into two dataframe then merger them back together?
Thanks!
CodePudding user response:
dense_rank()
can be used to find out top versions based on id & type. This can be used to retain only the top record in each group.
input.withColumn("rank", dense_rank() over (Window.partitionBy($"id",$"type").orderBy($"version".desc)))
.filter($"rank" === 1)
.drop($"rank")
Output:
------ -------- ------- ------------ --- ---- -------- --------
|id |pub_date|version|unique_id |_id|p_id|type |source |
------ -------- ------- ------------ --- ---- -------- --------
|lni001|20220301|5 |64WP-UI-CFGT|012|K21 |location|internet|
|lni001|20220301|85 |64WP-UI-POLI|002|P02 |org |internet|
|lni001|20220301|85 |64WP-UI-POLI|002|P02 |org |internet|
|lni002|20220301|1 |64WP-UI-CFGT|012|K21 |location|internet|
------ -------- ------- ------------ --- ---- -------- --------