Home > Software design >  Group By multple columns with conditions Spark SQL
Group By multple columns with conditions Spark SQL

Time:07-25

Can anyone shed some lights to how I should tackle this problem.

Current data

Name Code Date Count
A 1A 2020-05-03 34
A 1A 2020-04-02 25
B 3D 2021-04-23 24
C 2X 2021-04-01 01
C 2X 2021-03-31 01

Desired Output:

Name Code Date Count
A 1A 2020-05-03 34
B 3D 2021-04-23 24
C 2X 2021-04-01 01
C 2X 2021-03-31 01

Output from my code:

Name Code Date Count
A 1A 2020-05-03 34
B 3D 2021-04-23 24
C 2X 2021-04-01 01

Below is my code:

SELECT 
  name, 
  code, 
  MAX(date) AS dates, 
  MAX(Cases_Number) AS Max_Num 
FROM(
  SELECT 
    lhd_2010_name AS name, 
     lhd_2010_code AS code, 
     notification_date AS date, 
     FLOOR(SUM(num)) as Cases_Number 
  FROM cases 
  GROUP BY 
    notification_date, 
    lhd_2010_name,
    lhd_2010_code 
  ORDER BY Cases_Number DESC, notification_date, lhd_2010_name DESC
) AS innertable 
  
GROUP BY name,code ORDER BY Max_Num DESC")

In the innertable I had to sum up the counts as all the counts were 1 before with GroupBy Name Code and Date to get the total counts. Then on the outertable I have to find the max count based on Name Code combination. If max count is the same name code combination, we will output the row too.

I understand the reason for the missing row is because I have used max(date), but this is the only way for me to be able to group by name and code, and also showing the dates. If I try to group by name, code, and dates it will show all other rows.

Thanks in Advance

CodePudding user response:

Let's call your main table main, we can first group by name, code and count to find the count (of duplicates), we name the alias countDup and we filter countDup > 1, basically, we need these kind of rows:

|C   |2X  |1    |2020-04-01|

The code looks like this:

val ds2 = main.groupBy("name", "code", "count")
  .agg(count("*").alias("countDup"))
  .where(col("countDup")
  .gt(1))

Preview of the code:

 ---- ---- ----- -------- 
|name|code|count|countDup|
 ---- ---- ----- -------- 
|   C|  2X|    1|       2|
 ---- ---- ----- -------- 

Then, we join with main table (left join), we add a rank to get maximum count, then we use a filter to filter only rows that we want, code:

main
  .join(ds2, Seq("name", "code", "count"), "left")
  .withColumn("ranking", expr("max(count) over (partition by name,code)"))
  .filter(col("countDup").isNotNull || col("count").equalTo(col("ranking")))
  .drop("countDup", "ranking")
  .orderBy("name")

Final output (with order in name):

 ---- ---- ----- ---------- 
|name|code|count|date      |
 ---- ---- ----- ---------- 
|A   |1A  |34   |2020-05-03|
|B   |3D  |24   |2020-04-23|
|C   |2X  |1    |2020-04-01|
|C   |2X  |1    |2020-03-31|
 ---- ---- ----- ---------- 

I hope this is what you need!

SPARK SQL VERSION

First, we create the temp table:

main.createTempView("main")

Then apply the following SQL:

SELECT name,code,date,count FROM (
    SELECT m.name,m.code,m.date,m.count,r.countDup,MAX(m.count) OVER (PARTITION BY m.name,m.code) AS ranking FROM main m LEFT JOIN (
        SELECT name,code,count,COUNT(*) AS countDup FROM main GROUP BY name,code,count HAVING COUNT(*) > 1) r 
    ON m.name = r.name AND m.code = r.code AND m.count = r.count) 
WHERE countDup > 0 OR count == ranking ORDER BY name

Result is the same as above!

  • Related