Can anyone shed some lights to how I should tackle this problem.
Current data
Name | Code | Date | Count |
---|---|---|---|
A | 1A | 2020-05-03 | 34 |
A | 1A | 2020-04-02 | 25 |
B | 3D | 2021-04-23 | 24 |
C | 2X | 2021-04-01 | 01 |
C | 2X | 2021-03-31 | 01 |
Desired Output:
Name | Code | Date | Count |
---|---|---|---|
A | 1A | 2020-05-03 | 34 |
B | 3D | 2021-04-23 | 24 |
C | 2X | 2021-04-01 | 01 |
C | 2X | 2021-03-31 | 01 |
Output from my code:
Name | Code | Date | Count |
---|---|---|---|
A | 1A | 2020-05-03 | 34 |
B | 3D | 2021-04-23 | 24 |
C | 2X | 2021-04-01 | 01 |
Below is my code:
SELECT
name,
code,
MAX(date) AS dates,
MAX(Cases_Number) AS Max_Num
FROM(
SELECT
lhd_2010_name AS name,
lhd_2010_code AS code,
notification_date AS date,
FLOOR(SUM(num)) as Cases_Number
FROM cases
GROUP BY
notification_date,
lhd_2010_name,
lhd_2010_code
ORDER BY Cases_Number DESC, notification_date, lhd_2010_name DESC
) AS innertable
GROUP BY name,code ORDER BY Max_Num DESC")
In the innertable I had to sum up the counts as all the counts were 1 before with GroupBy Name Code and Date to get the total counts. Then on the outertable I have to find the max count based on Name Code combination. If max count is the same name code combination, we will output the row too.
I understand the reason for the missing row is because I have used max(date), but this is the only way for me to be able to group by name and code, and also showing the dates. If I try to group by name, code, and dates it will show all other rows.
Thanks in Advance
CodePudding user response:
Let's call your main table main
, we can first group by name
, code
and count
to find the count (of duplicates), we name the alias countDup
and we filter countDup
> 1, basically, we need these kind of rows:
|C |2X |1 |2020-04-01|
The code looks like this:
val ds2 = main.groupBy("name", "code", "count")
.agg(count("*").alias("countDup"))
.where(col("countDup")
.gt(1))
Preview of the code:
---- ---- ----- --------
|name|code|count|countDup|
---- ---- ----- --------
| C| 2X| 1| 2|
---- ---- ----- --------
Then, we join with main table (left join), we add a rank to get maximum count, then we use a filter to filter only rows that we want, code:
main
.join(ds2, Seq("name", "code", "count"), "left")
.withColumn("ranking", expr("max(count) over (partition by name,code)"))
.filter(col("countDup").isNotNull || col("count").equalTo(col("ranking")))
.drop("countDup", "ranking")
.orderBy("name")
Final output (with order in name
):
---- ---- ----- ----------
|name|code|count|date |
---- ---- ----- ----------
|A |1A |34 |2020-05-03|
|B |3D |24 |2020-04-23|
|C |2X |1 |2020-04-01|
|C |2X |1 |2020-03-31|
---- ---- ----- ----------
I hope this is what you need!
SPARK SQL VERSION
First, we create the temp table:
main.createTempView("main")
Then apply the following SQL:
SELECT name,code,date,count FROM (
SELECT m.name,m.code,m.date,m.count,r.countDup,MAX(m.count) OVER (PARTITION BY m.name,m.code) AS ranking FROM main m LEFT JOIN (
SELECT name,code,count,COUNT(*) AS countDup FROM main GROUP BY name,code,count HAVING COUNT(*) > 1) r
ON m.name = r.name AND m.code = r.code AND m.count = r.count)
WHERE countDup > 0 OR count == ranking ORDER BY name
Result is the same as above!