I have a huge table with > 1M rows, and it looks like:
Row | Date | Condition | Value |
---|---|---|---|
1 | 20210101 | Shipment | 100 |
2 | 20210101 | Call | 200 |
3 | 20210102 | Examination | 500 |
4 | 20210102 | Examination | 5 |
5 | 20210103 | Shipment | 100 |
In most cases, there is only one single row, when I consider both Date and Condition (such as row 1, 2, 5). However, I do find few situations, like row 3 and 4, where Date and Condition are exactly the same; yet there are more than one row showed up.
I would like to find these rows so that further analysis can be done.
I could use group by and get averages/mediums for Value, considering Date and Condition.
select Date, Condition, Max(value) from test_table
group by Date, Condition
However how do I know which rows (such as 3 and 4) actually got averaged? Maybe I could ask this question in a better way but am relatively new to sql and would appreciate your advice here. Thank you.
CodePudding user response:
You can find the duplicate rows by using a window function to compute the "group size". For example:
select *
from (
select *,
count(*) over(partition by date, condit) as group_size
from t
) x
where group_size > 1
Result:
id date condit value group_size
--- ----------- ------------ ------ ----------
3 2021-01-02 Examination 500 2
4 2021-01-02 Examination 5 2
See running example at DB Fiddle.