Find an illegal combination of values (composite key) in a table-CodePudding

I need to spot an invalid combination of values within a large data set. The demo below is a very small data set; however, it illustrates the priniple. (I use Redshift)

I receive a data set like with 3 columns: client_id, t_number, t_name.

Client_id, t_number, t_name are repeated many times within the data set (when a clients records a new transaction each of these is recorded as a new line in a file). t_number (transaction number - (an id of transaction)) and t_name (transaction name) should be an unique combination. A specific t_number should only match to a single t_name (these are duplicated data essentially as number and name should reflect the same activity).

The data set has no null values. There are more t_numbers than t_names. This suggests that there must be a t_name (set of t_names) with more than one t_number assigned.

The simplified data set is below:

create table tmp.test_v1 (
id varchar(5)  
, t_number smallint  
, t_name varchar(5));

Insert into tmp.test_v1 values  
    ('id_1', 1, 'aaa')  
    ,('id_2', 1, 'aaa')  
    ,('id_3', 2, 'bbb')  
    ,('id_4', 3, 'ccc')  
    ,('id_5', 2, 'bbb')  
    ,('id_6', 4, 'ddd')  
    ,('id_3', 5, 'aaa')  
    ,('id_1', 2, 'bbb')  
    ,('id_2', 3, 'ccc')  
    ,('id_4', 4, 'ddd')  
    ,('id_7', 6, 'eee')  
    ,('id_8', 6, 'eee')  
    ,('id_9', 6, 'eee');

The t_name 'aaa' is associated with t_number: 1 and 5. This is a situation that i want to catch.

I came up with a following query:

 select t_name, t_number from tmp.test_v1  
    where t_name = (select t_name from (  
    select a.t_name, b.t_number  
    from tmp.test_v1 as a   
    left join tmp.test_v1 as b on a.t_name = b.t_name  
    group by a.t_name, b.t_number  
    order by a.t_name) c  
    group by t_name  
    having count(t_name) > 1)  
    group by t_name, t_number;

The above query returns:

'---------------------'  
| t_name  | t_number  |  
'---------------------'    
| aaa     |    1      |
'---------------------'   
| aaa     |    5      |        
'---------------------'

I can see that t_name 'aaa' has numbers 1,5 assigned to it.

Problem: this does not scale well. On a data set with 130 million records it seems to run endlessly.

What would be a more efficient way to arrive at the same/ similar result?

CodePudding user response：

I think the join is taking all the time. You can avoid the join and maybe make it faster with:

select distinct t_name, t_number from test_v1 where t_name in (
  select t_name
  from ( select distinct t_name, t_number from test_v1 ) t1
  group by t_name
  having count(t_number)>1
)

Try it here: https://www.db-fiddle.com/f/vhwn1WKB7uV2yUZ4FhNPH/0

CodePudding user response：

The joining is cost you all the time. This is basically a looping condition and with the number of repeated values you have this will explode the data. Because you want to have both the name and numbers that are violations this makes things a little tricky. Here's how I'd go after this:

select t_name, t_number
from (
  select t_name, t_number, 
      count(t_number) over(partition by t_name) as cnt
  from test_v1
  group by 1, 2 ) sub
where cnt > 1

This only scans the table once, doesn't loop join, and drives down the data size quickly through the "group by".