Home > Software engineering >  Systematically track rows being grouped by in Postgresql?
Systematically track rows being grouped by in Postgresql?

Time:10-04

I have a huge table with > 1M rows, and it looks like:

Row Date Condition Value
1 20210101 Shipment 100
2 20210101 Call 200
3 20210102 Examination 500
4 20210102 Examination 5
5 20210103 Shipment 100

In most cases, there is only one single row, when I consider both Date and Condition (such as row 1, 2, 5). However, I do find few situations, like row 3 and 4, where Date and Condition are exactly the same; yet there are more than one row showed up.

I would like to find these rows so that further analysis can be done.

I could use group by and get averages/mediums for Value, considering Date and Condition.

select Date, Condition, Max(value) from test_table
group by Date, Condition

However how do I know which rows (such as 3 and 4) actually got averaged? Maybe I could ask this question in a better way but am relatively new to sql and would appreciate your advice here. Thank you.

CodePudding user response:

You can find the duplicate rows by using a window function to compute the "group size". For example:

select *
from (
  select *,
    count(*) over(partition by date, condit) as group_size
  from t
) x
where group_size > 1

Result:

 id  date        condit       value  group_size 
 --- ----------- ------------ ------ ---------- 
 3   2021-01-02  Examination  500    2          
 4   2021-01-02  Examination  5      2          

See running example at DB Fiddle.

  • Related