Home > front end >  Compare multiple records of normalized data
Compare multiple records of normalized data

Time:04-26

I have data of the following format for which I am trying to preform comparison using SQL:

The data are ratings of "groups" with up to 4 values rated per group:

tblRatings

RatingID GroupID Value1 Value2 Value3 Value4
1 2222 13 19 (null) (null)
2 2222 13 (null) (null) (null)
3 2223 1 (null) (null) (null)
4 2223 1 (null) (null) (null)
5 2224 5 (null) (null) (null)
6 2225 10 12 13 (null)
7 2225 12 13 10 (null)

My goal is to compare the records and determine which GroupIDs have two ratings that match, with a match defined as the same list of values in any order, with all nulls to be ignored. Thus in the example data, GroupIDs 2223 and 2225 have a match, and the others do not.

How would you proceed to perform this comparison?

As a first step, I have used a union query to normalize the data to one value per row, as follows:

qryRatingsNormalized

RatingID GroupID Value
1 2222 13
1 2222 19
2 2222 13
3 2223 1
4 2223 1
5 2224 5
6 2225 10
6 2225 12
6 2225 13
7 2225 12
7 2225 13
7 2225 10

However I'm unsure how to proceed from there.

FYI, I am working in MS Access using tables linked in SQL Server.

CodePudding user response:

If you are using SQL-server you can try to use CROSS APPLY

SELECT t1.RatingID,
        t1.GroupID,
        v.value
FROM T t1 CROSS APPLY (
    VALUES 
        (Value1),
        (Value2),
        (Value3),
        (Value4)
) v (value)
WHERE v.value IS NOT NULL

Or use UNION ALL

SELECT t1.*
FROM ( 
    SELECT RatingID,GroupID,Value1 value
    FROM T 
    UNION ALL 
    SELECT RatingID,GroupID,Value2
    FROM T 
    UNION ALL 
    SELECT RatingID,GroupID,Value3
    FROM T 
    UNION ALL 
    SELECT RatingID,GroupID,Value4
    FROM T 
) t1
WHERE value IS NOT NULL

sqlfiddle

CodePudding user response:

You could try to set up a pass-through query with tsql to get around the Access limitations:

with r as (
    select GroupID, RatingID, val,
        row_number() over (partition by RatingID order by val) as rn,
        count(*) over (partition by RatingID) as cnt
    from tblRatings cross apply
        (values (Value1), (Value2), (Value3), (Value4)) as v (val)
    where val is not null
)
select r1.RatingID, r2.RatingID, min(r1.GroupID) as GroupID
from r r1 left outer join r r2 on
    r2.GroupID = r1.GroupID and r2.RatingID > r1.RatingID
    and r2.cnt = r1.cnt
    and r2.rn = r1.rn and r2.val = r1.val
group by r1.RatingID, r2.RatingID
having count(r2.val) = count(*) and count(*) = min(r2.cnt);

See working example here:

https://dbfiddle.uk/?rdbms=sqlserver_2014&fiddle=ab79c9bcafbfcad2efbfdeb5dc102420

CodePudding user response:

I haven't really kept up with the state of Access but I'm thinking something like this should be valid. You may be able to get away with matching on a set of aggregate values that stand in for the individual values.

select r1.RatingID, r2.RatingID, r1.GroupID
from
    (
    select
        GroupID, RatingID,
        count(Value) cnt, sum(Value) tot,
        avg(Value) avg, min(Value) as lst, max(Value) as grt,
        floor(sum(log(Value)) * 100000) as lg/* assumes no zeroes. a positive offset would likely fix that problem too */
    from qryRatingsNormalized
    group by GroupID, RatingID
    ) r1
        inner join
    (
    select
        GroupID, RatingID,
        count(Value) cnt, sum(Value) tot,
        avg(Value) avg, min(Value) as lst, max(Value) as grt,
        floor(sum(log(Value)) * 100000) as lg
    from qryRatingsNormalized
    group by GroupID, RatingID
    ) r2 on     r2.GroupID = r1.GroupID and r2.RatingID > r1.RatingID
            and r2.cnt = r1.cnt and r2.lst = r1.lst and r2.grt = r1.grt
            and r2.tot = r1.tot and r2.avg = r1.avg and r2.lg = r1.lg

I've repeated the same subquery twice so perhaps you'd need to just define that as a named Query/View to get this to work without the named subquery.

It's probably advantageous that your maximum is four values. Some of this probably hinges on whether you need to do this repeatedly, what the values are, how many groups you've got, whether zero is valid, whether negative numbers are valid...

https://dbfiddle.uk/?rdbms=sqlserver_2014&fiddle=ab79c9bcafbfcad2efbfdeb5dc102420

  • Related