Join SQL server tables without a unique identifier and duplicate data-CodePudding

I'm relatively inexperienced in SQL and could use some help beyond the usual SELECT and JOIN.

The Problem

Suppose you have 2 tables you wish to join in Microsoft SQL, however they are missing a unique identifier so duplicates entries are incorrectly generated. I've created an example SQLfiddle to try and demonstrate using a small subset of the full database schema http://sqlfiddle.com/#!18/df3fc.

One table has a list of measurement steps taken for 2 systems, identified by their serial. These measurement steps can contain multiple pieces of data, which are contained in the second table. This would not normally be an issue but, as in the sqlfiddle example for serial=1004, sometimes the same data may be retaken as part of a rework. When I then query, each piece of rework data gets joined to each step, duplicating data. The select query:

SELECT my_measurement_steps.id AS steps_id, my_measurement_steps.serial, my_measurement_data.id AS data_id, my_measurement_data.my_data, my_measurement_data.measurementid, my_measurement_steps.date
FROM my_measurement_steps INNER JOIN
      my_measurement_data ON my_measurement_steps.serial = my_measurement_data.serial AND
        my_measurement_steps.measurementid = my_measurement_data.measurementid

Desired Output

steps_id	serial	data_id	my_data	measurementid	date
15	1004	36	0.9496555	33	2021-10-12 07:55:58.100
14	1004	35	-0.03252285	11	2021-10-07 07:56:31.530
14	1004	34	-0.0003081787	11	2021-10-07 07:56:31.530
13	1004	33	-0.01728721	10	2021-10-07 07:56:31.530
13	1004	32	-0.1996608	10	2021-10-07 07:56:31.530
12	1004	31	0.003044653	9	2021-10-07 07:24:49.500
12	1004	30	0.002392432	9	2021-10-07 07:24:49.500
11	1004	29	1.012242	8	2021-10-07 07:24:30.720
11	1004	28	1.003897	8	2021-10-07 07:24:30.720
11	1004	27	0.9917302	8	2021-10-07 07:24:30.720
11	1004	26	-0.002975781	8	2021-10-07 07:24:30.720
11	1004	25	-0.002746948	8	2021-10-07 07:24:30.720
10	1004	24	0.9695401	33	2021-10-05 11:37:51.430
9	1005	23	0.9731983	33	2021-10-05 08:00:10.490
8	1005	22	0.01013499	11	2021-10-01 07:12:07.470
8	1005	21	-0.007311231	11	2021-10-01 07:12:07.470
7	1005	20	-0.0003634033	10	2021-10-01 07:12:07.470
7	1005	19	-0.2021408	10	2021-10-01 07:12:07.470
6	1005	18	-0.002507007	9	2021-09-30 13:00:57.260
6	1005	17	0.001181299	9	2021-09-30 13:00:57.260
5	1005	16	1.007857	8	2021-09-30 12:39:50.280
5	1005	15	1.000333	8	2021-09-30 12:39:50.280
5	1005	14	0.9913442	8	2021-09-30 12:39:50.280
5	1005	13	0.002449243	8	2021-09-30 12:39:50.280
5	1005	12	-0.002550488	8	2021-09-30 12:39:50.280
4	1004	11	-0.02970417	11	2021-09-30 06:57:33.160
4	1004	10	-0.0007542603	11	2021-09-30 06:57:33.160
3	1004	9	-0.005267761	10	2021-09-30 06:57:33.160
3	1004	8	-0.2038888	10	2021-09-30 06:57:33.160
2	1004	7	-0.007525305	9	2021-09-30 06:56:59.060
2	1004	6	-0.004998779	9	2021-09-30 06:56:59.060
1	1004	5	0.9935537	8	2021-09-29 12:34:08.090
1	1004	4	0.9952038	8	2021-09-29 12:34:08.090
1	1004	3	0.9978707	8	2021-09-29 12:34:08.090
1	1004	2	-0.0006630127	8	2021-09-29 12:34:08.090
1	1004	1	0.0002386719	8	2021-09-29 12:34:08.090

I'm unsure how to achieve the desired output given the repeating data. Also for some serials there can be more than 1 repeat as shown in the example.

Happy to provide any extra information required.

Many Thanks.

Code to Generate Tables

create table my_measurement_steps(id int, serial int, measurementid int, date datetime);
create table my_measurement_data(id int, serial int, my_data float(7), measurementid int);

insert into my_measurement_steps values 
(1,1004,8,'2021-09-29 12:34:08.090'),
(2,1004,9,'2021-09-30 06:56:59.060'),
(3,1004,10,'2021-09-30 06:57:33.160'),
(4,1004,11,'2021-09-30 06:57:33.160'),
(5,1005,8,'2021-09-30 12:39:50.280'),
(6,1005,9,'2021-09-30 13:00:57.260'),
(7,1005,10,'2021-10-01 07:12:07.470'),
(8,1005,11,'2021-10-01 07:12:07.470'),
(9,1004,33,'2021-10-05 08:00:10.490'),
(10,1005,33,'2021-10-05 11:37:51.430'),
(11,1004,8,'2021-10-07 07:24:30.720'),
(12,1004,9,'2021-10-07 07:24:49.500'),
(13,1004,10,'2021-10-07 07:56:31.530'),
(14,1004,11,'2021-10-07 07:56:31.530'),
(15,1004,33,'2021-10-12 07:55:58.100');

insert into my_measurement_data values 
(1,1004,0.0002386719,8),
(2,1004,-0.0006630127,8),
(3,1004,0.9978707,8),
(4,1004,0.9952038,8),
(5,1004,0.9935537,8),
(6,1004,-0.004998779,9),
(7,1004,-0.007525305,9),
(8,1004,-0.2038888,10),
(9,1004,-0.005267761,10),
(10,1004,-0.0007542603,11),
(11,1004,-0.02970417,11),
(12,1005,-0.002550488,8),
(13,1005,0.002449243,8),
(14,1005,0.9913442,8),
(15,1005,1.000333,8),
(16,1005,1.007857,8),
(17,1005,0.001181299,9),
(18,1005,-0.002507007,9),
(19,1005,-0.2021408,10),
(20,1005,-0.0003634033,10),
(21,1005,-0.007311231,11),
(22,1005,0.01013499,11),
(23,1004,0.9695401,33),
(24,1005,0.9731983,33),
(25,1004,-0.002746948,8),
(26,1004,-0.002975781,8),
(27,1004,0.9917302,8),
(28,1004,1.003897,8),
(29,1004,1.012242,8),
(30,1004,0.002392432,9),
(31,1004,0.003044653,9),
(32,1004,-0.1996608,10),
(33,1004,-0.01728721,10),
(34,1004,-0.0003081787,11),
(35,1004,-0.03252285,11),
(36,1004,0.9496555,33);

Edits Added datestamp to measurement step table - sqlfiddle not working so can't update.

All tables now updated and sqlfiddle

Removed section and added desired output

CodePudding user response：

from what I understood you wanted to remove the duplicate rows generated from the select query.
You can use the distinct keyword to do that

SELECT DISTINCT my_measurement_steps.serial, my_measurement_steps.measurementid, 
my_measurement_data.my_data
FROM my_measurement_steps INNER JOIN
      my_measurement_data ON my_measurement_steps.serial = 
my_measurement_data.serial AND
        my_measurement_steps.measurementid = my_measurement_data.measurementid

CodePudding user response：

You want to detect blocks of rows belonging together.

When sorting my_measurement_steps we see that serial/measurementid 1004/8 occurs twice for instance, once in row #1 and then again in row #11.

When sorting my_measurement_data we see about the same thing. The serial/measurementid 1004/8 occurs in two blocks, once in rows #1-5 and then again in rows #25-29.

You want to join the serial/measurementid's nth occurence in my_measurement_steps with its nth occurrence in my_measurement_data.

The detection of such blocks is called a gaps and islands problem. This can be done with two concurrent row counts.

with data_groups_found as
(
  select
    my_measurement_data.*,
    row_number() over (order by id) -
    row_number() over (partition by serial, measurementid order by id) as grp
  from my_measurement_data
)
, data_groups_numbered as
(
  select
    data_groups_found.*,
    dense_rank() over (partition by serial, measurementid order by grp) as grp_id
  from data_groups_found
)
, steps_numbered as
(
  select
    my_measurement_steps.*,
    row_number() over (partition by serial, measurementid order by id) as grp_id
  from my_measurement_steps
)
select *
from steps_numbered s
left join data_groups_numbered d
            on d.serial = s.serial
           and d.measurementid = s.measurementid
           and d.grp_id = s.grp_id
order by s.id, d.id;

Demo: http://sqlfiddle.com/#!18/df3fc/6