I'm relatively inexperienced in SQL and could use some help beyond the usual SELECT and JOIN.
The Problem
Suppose you have 2 tables you wish to join in Microsoft SQL, however they are missing a unique identifier so duplicates entries are incorrectly generated. I've created an example SQLfiddle to try and demonstrate using a small subset of the full database schema http://sqlfiddle.com/#!18/df3fc.
One table has a list of measurement steps taken for 2 systems, identified by their serial. These measurement steps can contain multiple pieces of data, which are contained in the second table. This would not normally be an issue but, as in the sqlfiddle example for serial=1004, sometimes the same data may be retaken as part of a rework. When I then query, each piece of rework data gets joined to each step, duplicating data. The select query:
SELECT my_measurement_steps.id AS steps_id, my_measurement_steps.serial, my_measurement_data.id AS data_id, my_measurement_data.my_data, my_measurement_data.measurementid, my_measurement_steps.date
FROM my_measurement_steps INNER JOIN
my_measurement_data ON my_measurement_steps.serial = my_measurement_data.serial AND
my_measurement_steps.measurementid = my_measurement_data.measurementid
Desired Output
steps_id | serial | data_id | my_data | measurementid | date |
---|---|---|---|---|---|
15 | 1004 | 36 | 0.9496555 | 33 | 2021-10-12 07:55:58.100 |
14 | 1004 | 35 | -0.03252285 | 11 | 2021-10-07 07:56:31.530 |
14 | 1004 | 34 | -0.0003081787 | 11 | 2021-10-07 07:56:31.530 |
13 | 1004 | 33 | -0.01728721 | 10 | 2021-10-07 07:56:31.530 |
13 | 1004 | 32 | -0.1996608 | 10 | 2021-10-07 07:56:31.530 |
12 | 1004 | 31 | 0.003044653 | 9 | 2021-10-07 07:24:49.500 |
12 | 1004 | 30 | 0.002392432 | 9 | 2021-10-07 07:24:49.500 |
11 | 1004 | 29 | 1.012242 | 8 | 2021-10-07 07:24:30.720 |
11 | 1004 | 28 | 1.003897 | 8 | 2021-10-07 07:24:30.720 |
11 | 1004 | 27 | 0.9917302 | 8 | 2021-10-07 07:24:30.720 |
11 | 1004 | 26 | -0.002975781 | 8 | 2021-10-07 07:24:30.720 |
11 | 1004 | 25 | -0.002746948 | 8 | 2021-10-07 07:24:30.720 |
10 | 1004 | 24 | 0.9695401 | 33 | 2021-10-05 11:37:51.430 |
9 | 1005 | 23 | 0.9731983 | 33 | 2021-10-05 08:00:10.490 |
8 | 1005 | 22 | 0.01013499 | 11 | 2021-10-01 07:12:07.470 |
8 | 1005 | 21 | -0.007311231 | 11 | 2021-10-01 07:12:07.470 |
7 | 1005 | 20 | -0.0003634033 | 10 | 2021-10-01 07:12:07.470 |
7 | 1005 | 19 | -0.2021408 | 10 | 2021-10-01 07:12:07.470 |
6 | 1005 | 18 | -0.002507007 | 9 | 2021-09-30 13:00:57.260 |
6 | 1005 | 17 | 0.001181299 | 9 | 2021-09-30 13:00:57.260 |
5 | 1005 | 16 | 1.007857 | 8 | 2021-09-30 12:39:50.280 |
5 | 1005 | 15 | 1.000333 | 8 | 2021-09-30 12:39:50.280 |
5 | 1005 | 14 | 0.9913442 | 8 | 2021-09-30 12:39:50.280 |
5 | 1005 | 13 | 0.002449243 | 8 | 2021-09-30 12:39:50.280 |
5 | 1005 | 12 | -0.002550488 | 8 | 2021-09-30 12:39:50.280 |
4 | 1004 | 11 | -0.02970417 | 11 | 2021-09-30 06:57:33.160 |
4 | 1004 | 10 | -0.0007542603 | 11 | 2021-09-30 06:57:33.160 |
3 | 1004 | 9 | -0.005267761 | 10 | 2021-09-30 06:57:33.160 |
3 | 1004 | 8 | -0.2038888 | 10 | 2021-09-30 06:57:33.160 |
2 | 1004 | 7 | -0.007525305 | 9 | 2021-09-30 06:56:59.060 |
2 | 1004 | 6 | -0.004998779 | 9 | 2021-09-30 06:56:59.060 |
1 | 1004 | 5 | 0.9935537 | 8 | 2021-09-29 12:34:08.090 |
1 | 1004 | 4 | 0.9952038 | 8 | 2021-09-29 12:34:08.090 |
1 | 1004 | 3 | 0.9978707 | 8 | 2021-09-29 12:34:08.090 |
1 | 1004 | 2 | -0.0006630127 | 8 | 2021-09-29 12:34:08.090 |
1 | 1004 | 1 | 0.0002386719 | 8 | 2021-09-29 12:34:08.090 |
I'm unsure how to achieve the desired output given the repeating data. Also for some serials there can be more than 1 repeat as shown in the example.
Happy to provide any extra information required.
Many Thanks.
Code to Generate Tables
create table my_measurement_steps(id int, serial int, measurementid int, date datetime);
create table my_measurement_data(id int, serial int, my_data float(7), measurementid int);
insert into my_measurement_steps values
(1,1004,8,'2021-09-29 12:34:08.090'),
(2,1004,9,'2021-09-30 06:56:59.060'),
(3,1004,10,'2021-09-30 06:57:33.160'),
(4,1004,11,'2021-09-30 06:57:33.160'),
(5,1005,8,'2021-09-30 12:39:50.280'),
(6,1005,9,'2021-09-30 13:00:57.260'),
(7,1005,10,'2021-10-01 07:12:07.470'),
(8,1005,11,'2021-10-01 07:12:07.470'),
(9,1004,33,'2021-10-05 08:00:10.490'),
(10,1005,33,'2021-10-05 11:37:51.430'),
(11,1004,8,'2021-10-07 07:24:30.720'),
(12,1004,9,'2021-10-07 07:24:49.500'),
(13,1004,10,'2021-10-07 07:56:31.530'),
(14,1004,11,'2021-10-07 07:56:31.530'),
(15,1004,33,'2021-10-12 07:55:58.100');
insert into my_measurement_data values
(1,1004,0.0002386719,8),
(2,1004,-0.0006630127,8),
(3,1004,0.9978707,8),
(4,1004,0.9952038,8),
(5,1004,0.9935537,8),
(6,1004,-0.004998779,9),
(7,1004,-0.007525305,9),
(8,1004,-0.2038888,10),
(9,1004,-0.005267761,10),
(10,1004,-0.0007542603,11),
(11,1004,-0.02970417,11),
(12,1005,-0.002550488,8),
(13,1005,0.002449243,8),
(14,1005,0.9913442,8),
(15,1005,1.000333,8),
(16,1005,1.007857,8),
(17,1005,0.001181299,9),
(18,1005,-0.002507007,9),
(19,1005,-0.2021408,10),
(20,1005,-0.0003634033,10),
(21,1005,-0.007311231,11),
(22,1005,0.01013499,11),
(23,1004,0.9695401,33),
(24,1005,0.9731983,33),
(25,1004,-0.002746948,8),
(26,1004,-0.002975781,8),
(27,1004,0.9917302,8),
(28,1004,1.003897,8),
(29,1004,1.012242,8),
(30,1004,0.002392432,9),
(31,1004,0.003044653,9),
(32,1004,-0.1996608,10),
(33,1004,-0.01728721,10),
(34,1004,-0.0003081787,11),
(35,1004,-0.03252285,11),
(36,1004,0.9496555,33);
Edits Added datestamp to measurement step table - sqlfiddle not working so can't update.
All tables now updated and sqlfiddle
Removed section and added desired output
CodePudding user response:
from what I understood you wanted to remove the duplicate rows generated from the select query.
You can use the distinct
keyword to do that
SELECT DISTINCT my_measurement_steps.serial, my_measurement_steps.measurementid,
my_measurement_data.my_data
FROM my_measurement_steps INNER JOIN
my_measurement_data ON my_measurement_steps.serial =
my_measurement_data.serial AND
my_measurement_steps.measurementid = my_measurement_data.measurementid
CodePudding user response:
You want to detect blocks of rows belonging together.
When sorting my_measurement_steps
we see that serial/measurementid 1004/8 occurs twice for instance, once in row #1 and then again in row #11.
When sorting my_measurement_data
we see about the same thing. The serial/measurementid 1004/8 occurs in two blocks, once in rows #1-5 and then again in rows #25-29.
You want to join the serial/measurementid's nth occurence in my_measurement_steps
with its nth occurrence in my_measurement_data
.
The detection of such blocks is called a gaps and islands problem. This can be done with two concurrent row counts.
with data_groups_found as
(
select
my_measurement_data.*,
row_number() over (order by id) -
row_number() over (partition by serial, measurementid order by id) as grp
from my_measurement_data
)
, data_groups_numbered as
(
select
data_groups_found.*,
dense_rank() over (partition by serial, measurementid order by grp) as grp_id
from data_groups_found
)
, steps_numbered as
(
select
my_measurement_steps.*,
row_number() over (partition by serial, measurementid order by id) as grp_id
from my_measurement_steps
)
select *
from steps_numbered s
left join data_groups_numbered d
on d.serial = s.serial
and d.measurementid = s.measurementid
and d.grp_id = s.grp_id
order by s.id, d.id;