Home > Blockchain >  Better/Efficient way to filter out Spark Dataframe rows with multiple conditions
Better/Efficient way to filter out Spark Dataframe rows with multiple conditions

Time:05-13

I have a dataframe look like this below

  id          pub_date   version         unique_id     c_id    p_id    type      source
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      2           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      2           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      3           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      3           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      5           64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      1           64WP-UI-CFGT    012     K21   location  internet
 ::
 ::

I want to groupby id column and only keep the highest number from version column but here is a catch, I also need to take into consideration for the type column (which only have two types, org or location). The final dataframe will look like this below

  id          pub_date   version         unique_id     c_id    p_id    type      source
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      5           64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      14          64WP-UI-CFGT    012     K21   location  internet
 ::
 ::

My current approach is separate the dataframe into two different ones, first one is org under type column, the other one is location under type column.Then I am using groupby, withColumn but my dataframe is huge. And I am wondering are there more efficient ways to do this maybe in one line of code? Rather than need to separate them into two dataframe then merger them back together?

Thanks!

CodePudding user response:

dense_rank() can be used to find out top versions based on id & type. This can be used to retain only the top record in each group.

input.withColumn("rank", dense_rank() over (Window.partitionBy($"id",$"type").orderBy($"version".desc)))
  .filter($"rank" === 1)
  .drop($"rank")

Output:

 ------ -------- ------- ------------ --- ---- -------- -------- 
|id    |pub_date|version|unique_id   |_id|p_id|type    |source  |
 ------ -------- ------- ------------ --- ---- -------- -------- 
|lni001|20220301|5      |64WP-UI-CFGT|012|K21 |location|internet|
|lni001|20220301|85     |64WP-UI-POLI|002|P02 |org     |internet|
|lni001|20220301|85     |64WP-UI-POLI|002|P02 |org     |internet|
|lni002|20220301|1      |64WP-UI-CFGT|012|K21 |location|internet|
 ------ -------- ------- ------------ --- ---- -------- -------- 
  • Related