Home > Mobile >  Join SQL server tables without a unique identifier and duplicate data
Join SQL server tables without a unique identifier and duplicate data

Time:10-22

I'm relatively inexperienced in SQL and could use some help beyond the usual SELECT and JOIN.

The Problem

Suppose you have 2 tables you wish to join in Microsoft SQL, however they are missing a unique identifier so duplicates entries are incorrectly generated. I've created an example SQLfiddle to try and demonstrate using a small subset of the full database schema http://sqlfiddle.com/#!18/df3fc.

One table has a list of measurement steps taken for 2 systems, identified by their serial. These measurement steps can contain multiple pieces of data, which are contained in the second table. This would not normally be an issue but, as in the sqlfiddle example for serial=1004, sometimes the same data may be retaken as part of a rework. When I then query, each piece of rework data gets joined to each step, duplicating data. The select query:

SELECT my_measurement_steps.id AS steps_id, my_measurement_steps.serial, my_measurement_data.id AS data_id, my_measurement_data.my_data, my_measurement_data.measurementid, my_measurement_steps.date
FROM my_measurement_steps INNER JOIN
      my_measurement_data ON my_measurement_steps.serial = my_measurement_data.serial AND
        my_measurement_steps.measurementid = my_measurement_data.measurementid

Desired Output

steps_id serial data_id my_data measurementid date
15 1004 36 0.9496555 33 2021-10-12 07:55:58.100
14 1004 35 -0.03252285 11 2021-10-07 07:56:31.530
14 1004 34 -0.0003081787 11 2021-10-07 07:56:31.530
13 1004 33 -0.01728721 10 2021-10-07 07:56:31.530
13 1004 32 -0.1996608 10 2021-10-07 07:56:31.530
12 1004 31 0.003044653 9 2021-10-07 07:24:49.500
12 1004 30 0.002392432 9 2021-10-07 07:24:49.500
11 1004 29 1.012242 8 2021-10-07 07:24:30.720
11 1004 28 1.003897 8 2021-10-07 07:24:30.720
11 1004 27 0.9917302 8 2021-10-07 07:24:30.720
11 1004 26 -0.002975781 8 2021-10-07 07:24:30.720
11 1004 25 -0.002746948 8 2021-10-07 07:24:30.720
10 1004 24 0.9695401 33 2021-10-05 11:37:51.430
9 1005 23 0.9731983 33 2021-10-05 08:00:10.490
8 1005 22 0.01013499 11 2021-10-01 07:12:07.470
8 1005 21 -0.007311231 11 2021-10-01 07:12:07.470
7 1005 20 -0.0003634033 10 2021-10-01 07:12:07.470
7 1005 19 -0.2021408 10 2021-10-01 07:12:07.470
6 1005 18 -0.002507007 9 2021-09-30 13:00:57.260
6 1005 17 0.001181299 9 2021-09-30 13:00:57.260
5 1005 16 1.007857 8 2021-09-30 12:39:50.280
5 1005 15 1.000333 8 2021-09-30 12:39:50.280
5 1005 14 0.9913442 8 2021-09-30 12:39:50.280
5 1005 13 0.002449243 8 2021-09-30 12:39:50.280
5 1005 12 -0.002550488 8 2021-09-30 12:39:50.280
4 1004 11 -0.02970417 11 2021-09-30 06:57:33.160
4 1004 10 -0.0007542603 11 2021-09-30 06:57:33.160
3 1004 9 -0.005267761 10 2021-09-30 06:57:33.160
3 1004 8 -0.2038888 10 2021-09-30 06:57:33.160
2 1004 7 -0.007525305 9 2021-09-30 06:56:59.060
2 1004 6 -0.004998779 9 2021-09-30 06:56:59.060
1 1004 5 0.9935537 8 2021-09-29 12:34:08.090
1 1004 4 0.9952038 8 2021-09-29 12:34:08.090
1 1004 3 0.9978707 8 2021-09-29 12:34:08.090
1 1004 2 -0.0006630127 8 2021-09-29 12:34:08.090
1 1004 1 0.0002386719 8 2021-09-29 12:34:08.090

I'm unsure how to achieve the desired output given the repeating data. Also for some serials there can be more than 1 repeat as shown in the example.

Happy to provide any extra information required.

Many Thanks.


Code to Generate Tables

create table my_measurement_steps(id int, serial int, measurementid int, date datetime);
create table my_measurement_data(id int, serial int, my_data float(7), measurementid int);

insert into my_measurement_steps values 
(1,1004,8,'2021-09-29 12:34:08.090'),
(2,1004,9,'2021-09-30 06:56:59.060'),
(3,1004,10,'2021-09-30 06:57:33.160'),
(4,1004,11,'2021-09-30 06:57:33.160'),
(5,1005,8,'2021-09-30 12:39:50.280'),
(6,1005,9,'2021-09-30 13:00:57.260'),
(7,1005,10,'2021-10-01 07:12:07.470'),
(8,1005,11,'2021-10-01 07:12:07.470'),
(9,1004,33,'2021-10-05 08:00:10.490'),
(10,1005,33,'2021-10-05 11:37:51.430'),
(11,1004,8,'2021-10-07 07:24:30.720'),
(12,1004,9,'2021-10-07 07:24:49.500'),
(13,1004,10,'2021-10-07 07:56:31.530'),
(14,1004,11,'2021-10-07 07:56:31.530'),
(15,1004,33,'2021-10-12 07:55:58.100');

insert into my_measurement_data values 
(1,1004,0.0002386719,8),
(2,1004,-0.0006630127,8),
(3,1004,0.9978707,8),
(4,1004,0.9952038,8),
(5,1004,0.9935537,8),
(6,1004,-0.004998779,9),
(7,1004,-0.007525305,9),
(8,1004,-0.2038888,10),
(9,1004,-0.005267761,10),
(10,1004,-0.0007542603,11),
(11,1004,-0.02970417,11),
(12,1005,-0.002550488,8),
(13,1005,0.002449243,8),
(14,1005,0.9913442,8),
(15,1005,1.000333,8),
(16,1005,1.007857,8),
(17,1005,0.001181299,9),
(18,1005,-0.002507007,9),
(19,1005,-0.2021408,10),
(20,1005,-0.0003634033,10),
(21,1005,-0.007311231,11),
(22,1005,0.01013499,11),
(23,1004,0.9695401,33),
(24,1005,0.9731983,33),
(25,1004,-0.002746948,8),
(26,1004,-0.002975781,8),
(27,1004,0.9917302,8),
(28,1004,1.003897,8),
(29,1004,1.012242,8),
(30,1004,0.002392432,9),
(31,1004,0.003044653,9),
(32,1004,-0.1996608,10),
(33,1004,-0.01728721,10),
(34,1004,-0.0003081787,11),
(35,1004,-0.03252285,11),
(36,1004,0.9496555,33);

Edits Added datestamp to measurement step table - sqlfiddle not working so can't update.

All tables now updated and sqlfiddle

Removed section and added desired output

CodePudding user response:

from what I understood you wanted to remove the duplicate rows generated from the select query.
You can use the distinct keyword to do that

SELECT DISTINCT my_measurement_steps.serial, my_measurement_steps.measurementid, 
my_measurement_data.my_data
FROM my_measurement_steps INNER JOIN
      my_measurement_data ON my_measurement_steps.serial = 
my_measurement_data.serial AND
        my_measurement_steps.measurementid = my_measurement_data.measurementid

CodePudding user response:

You want to detect blocks of rows belonging together.

When sorting my_measurement_steps we see that serial/measurementid 1004/8 occurs twice for instance, once in row #1 and then again in row #11.

When sorting my_measurement_data we see about the same thing. The serial/measurementid 1004/8 occurs in two blocks, once in rows #1-5 and then again in rows #25-29.

You want to join the serial/measurementid's nth occurence in my_measurement_steps with its nth occurrence in my_measurement_data.

The detection of such blocks is called a gaps and islands problem. This can be done with two concurrent row counts.

with data_groups_found as
(
  select
    my_measurement_data.*,
    row_number() over (order by id) -
    row_number() over (partition by serial, measurementid order by id) as grp
  from my_measurement_data
)
, data_groups_numbered as
(
  select
    data_groups_found.*,
    dense_rank() over (partition by serial, measurementid order by grp) as grp_id
  from data_groups_found
)
, steps_numbered as
(
  select
    my_measurement_steps.*,
    row_number() over (partition by serial, measurementid order by id) as grp_id
  from my_measurement_steps
)
select *
from steps_numbered s
left join data_groups_numbered d
            on d.serial = s.serial
           and d.measurementid = s.measurementid
           and d.grp_id = s.grp_id
order by s.id, d.id;

Demo: http://sqlfiddle.com/#!18/df3fc/6

  • Related