Group By multple columns with conditions Spark SQL-CodePudding

Can anyone shed some lights to how I should tackle this problem.

Current data

Name	Code	Date	Count
A	1A	2020-05-03	34
A	1A	2020-04-02	25
B	3D	2021-04-23	24
C	2X	2021-04-01	01
C	2X	2021-03-31	01

Desired Output:

Name	Code	Date	Count
A	1A	2020-05-03	34
B	3D	2021-04-23	24
C	2X	2021-04-01	01
C	2X	2021-03-31	01

Output from my code:

Name	Code	Date	Count
A	1A	2020-05-03	34
B	3D	2021-04-23	24
C	2X	2021-04-01	01

Below is my code:

SELECT 
  name, 
  code, 
  MAX(date) AS dates, 
  MAX(Cases_Number) AS Max_Num 
FROM(
  SELECT 
    lhd_2010_name AS name, 
     lhd_2010_code AS code, 
     notification_date AS date, 
     FLOOR(SUM(num)) as Cases_Number 
  FROM cases 
  GROUP BY 
    notification_date, 
    lhd_2010_name,
    lhd_2010_code 
  ORDER BY Cases_Number DESC, notification_date, lhd_2010_name DESC
) AS innertable 
  
GROUP BY name,code ORDER BY Max_Num DESC")

In the innertable I had to sum up the counts as all the counts were 1 before with GroupBy Name Code and Date to get the total counts. Then on the outertable I have to find the max count based on Name Code combination. If max count is the same name code combination, we will output the row too.

I understand the reason for the missing row is because I have used max(date), but this is the only way for me to be able to group by name and code, and also showing the dates. If I try to group by name, code, and dates it will show all other rows.

Thanks in Advance

CodePudding user response：

Let's call your main table main, we can first group by name, code and count to find the count (of duplicates), we name the alias countDup and we filter countDup > 1, basically, we need these kind of rows:

|C   |2X  |1    |2020-04-01|

The code looks like this:

val ds2 = main.groupBy("name", "code", "count")
  .agg(count("*").alias("countDup"))
  .where(col("countDup")
  .gt(1))

Preview of the code:

 ---- ---- ----- -------- 
|name|code|count|countDup|
 ---- ---- ----- -------- 
|   C|  2X|    1|       2|
 ---- ---- ----- --------

Then, we join with main table (left join), we add a rank to get maximum count, then we use a filter to filter only rows that we want, code:

main
  .join(ds2, Seq("name", "code", "count"), "left")
  .withColumn("ranking", expr("max(count) over (partition by name,code)"))
  .filter(col("countDup").isNotNull || col("count").equalTo(col("ranking")))
  .drop("countDup", "ranking")
  .orderBy("name")

Final output (with order in name):

 ---- ---- ----- ---------- 
|name|code|count|date      |
 ---- ---- ----- ---------- 
|A   |1A  |34   |2020-05-03|
|B   |3D  |24   |2020-04-23|
|C   |2X  |1    |2020-04-01|
|C   |2X  |1    |2020-03-31|
 ---- ---- ----- ----------

I hope this is what you need!

SPARK SQL VERSION

First, we create the temp table:

main.createTempView("main")

Then apply the following SQL:

SELECT name,code,date,count FROM (
    SELECT m.name,m.code,m.date,m.count,r.countDup,MAX(m.count) OVER (PARTITION BY m.name,m.code) AS ranking FROM main m LEFT JOIN (
        SELECT name,code,count,COUNT(*) AS countDup FROM main GROUP BY name,code,count HAVING COUNT(*) > 1) r 
    ON m.name = r.name AND m.code = r.code AND m.count = r.count) 
WHERE countDup > 0 OR count == ranking ORDER BY name

Result is the same as above!