I need to spot an invalid combination of values within a large data set. The demo below is a very small data set; however, it illustrates the priniple. (I use Redshift)
I receive a data set like with 3 columns: client_id, t_number, t_name.
Client_id, t_number, t_name are repeated many times within the data set (when a clients records a new transaction each of these is recorded as a new line in a file). t_number (transaction number - (an id of transaction)) and t_name (transaction name) should be an unique combination. A specific t_number should only match to a single t_name (these are duplicated data essentially as number and name should reflect the same activity).
The data set has no null values. There are more t_numbers than t_names. This suggests that there must be a t_name (set of t_names) with more than one t_number assigned.
The simplified data set is below:
create table tmp.test_v1 (
id varchar(5)
, t_number smallint
, t_name varchar(5));
Insert into tmp.test_v1 values
('id_1', 1, 'aaa')
,('id_2', 1, 'aaa')
,('id_3', 2, 'bbb')
,('id_4', 3, 'ccc')
,('id_5', 2, 'bbb')
,('id_6', 4, 'ddd')
,('id_3', 5, 'aaa')
,('id_1', 2, 'bbb')
,('id_2', 3, 'ccc')
,('id_4', 4, 'ddd')
,('id_7', 6, 'eee')
,('id_8', 6, 'eee')
,('id_9', 6, 'eee');
The t_name 'aaa' is associated with t_number: 1 and 5. This is a situation that i want to catch.
I came up with a following query:
select t_name, t_number from tmp.test_v1
where t_name = (select t_name from (
select a.t_name, b.t_number
from tmp.test_v1 as a
left join tmp.test_v1 as b on a.t_name = b.t_name
group by a.t_name, b.t_number
order by a.t_name) c
group by t_name
having count(t_name) > 1)
group by t_name, t_number;
The above query returns:
'---------------------'
| t_name | t_number |
'---------------------'
| aaa | 1 |
'---------------------'
| aaa | 5 |
'---------------------'
I can see that t_name 'aaa' has numbers 1,5 assigned to it.
Problem: this does not scale well. On a data set with 130 million records it seems to run endlessly.
What would be a more efficient way to arrive at the same/ similar result?
CodePudding user response:
I think the join is taking all the time. You can avoid the join and maybe make it faster with:
select distinct t_name, t_number from test_v1 where t_name in (
select t_name
from ( select distinct t_name, t_number from test_v1 ) t1
group by t_name
having count(t_number)>1
)
Try it here: https://www.db-fiddle.com/f/vhwn1WKB7uV2yUZ4FhNPH/0
CodePudding user response:
The joining is cost you all the time. This is basically a looping condition and with the number of repeated values you have this will explode the data. Because you want to have both the name and numbers that are violations this makes things a little tricky. Here's how I'd go after this:
select t_name, t_number
from (
select t_name, t_number,
count(t_number) over(partition by t_name) as cnt
from test_v1
group by 1, 2 ) sub
where cnt > 1
This only scans the table once, doesn't loop join, and drives down the data size quickly through the "group by".