Systematically track rows being grouped by in Postgresql?-CodePudding

I have a huge table with > 1M rows, and it looks like:

Row	Date	Condition	Value
1	20210101	Shipment	100
2	20210101	Call	200
3	20210102	Examination	500
4	20210102	Examination	5
5	20210103	Shipment	100

In most cases, there is only one single row, when I consider both Date and Condition (such as row 1, 2, 5). However, I do find few situations, like row 3 and 4, where Date and Condition are exactly the same; yet there are more than one row showed up.

I would like to find these rows so that further analysis can be done.

I could use group by and get averages/mediums for Value, considering Date and Condition.

select Date, Condition, Max(value) from test_table
group by Date, Condition

However how do I know which rows (such as 3 and 4) actually got averaged? Maybe I could ask this question in a better way but am relatively new to sql and would appreciate your advice here. Thank you.

CodePudding user response：

You can find the duplicate rows by using a window function to compute the "group size". For example:

select *
from (
  select *,
    count(*) over(partition by date, condit) as group_size
  from t
) x
where group_size > 1

Result:

 id  date        condit       value  group_size 
 --- ----------- ------------ ------ ---------- 
 3   2021-01-02  Examination  500    2          
 4   2021-01-02  Examination  5      2

See running example at DB Fiddle.