How to I get distinct combinations of one XRef column related to any value in the other XRef column-CodePudding

I need to select the count of unique value combinations of column B in an XRef table which is grouped by column A.

Consider the following schema and data, which represents a simple family structure. Each child has a father and mother:

TABLE Father

FatherID	Name
1	Alex
2	Bob

TABLE Mother

MotherID	Name
1	Alice
2	Barbara

TABLE Child

ChildID	FatherID	MotherID	Name
1	1 (Alex)	1 (Alice)	Adam
2	1 (Alex)	1 (Alice)	Billy
3	1 (Alex)	2 (Barbara)	Celine
4	2 (Bob)	2 (Barbara)	Derek

The distinct combinations of mothers for each father are:

Alex (Alice, Barbara)
Bob (Barbara)

In all there are two distinct combinations of mothers:

Alice, Barbara
Barbara

The query I want to write would return the count of those distinct combinations of mother, regardless of which father they are associated with:

UniqueMotherGroups
2

I was able to do this successfully using the STRING_AGG function, but it feels clunky. It also needs to operate over millions of rows and is quite slow at the moment. Is there a more idiomatic way to do this with set operations instead?

Here is my working example:

-- Drop pre-existing tables

DROP TABLE IF EXISTS dbo.Child;

DROP TABLE IF EXISTS dbo.Father;

DROP TABLE IF EXISTS dbo.Mother;

-- Create family tables.

CREATE TABLE dbo.Father
(
    FatherID INT NOT NULL
  , Name VARCHAR(50) NOT NULL
);

ALTER TABLE dbo.Father
ADD CONSTRAINT PK_Father
    PRIMARY KEY CLUSTERED (FatherID);

ALTER TABLE dbo.Father SET (LOCK_ESCALATION = TABLE);

CREATE TABLE dbo.Mother
(
    MotherID INT NOT NULL
  , Name VARCHAR(50) NOT NULL
);

ALTER TABLE dbo.Mother
ADD CONSTRAINT PK_Mother
    PRIMARY KEY CLUSTERED (MotherID);

ALTER TABLE dbo.Mother SET (LOCK_ESCALATION = TABLE);

CREATE TABLE dbo.Child
(
    ChildID INT NOT NULL
  , FatherID INT NOT NULL
  , MotherID INT NOT NULL
  , Name VARCHAR(50) NOT NULL
);

ALTER TABLE dbo.Child
ADD CONSTRAINT PK_Child
    PRIMARY KEY CLUSTERED (ChildID);

CREATE NONCLUSTERED INDEX IX_Parents ON dbo.Child (FatherID, MotherID);

ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Father
    FOREIGN KEY (FatherID)
    REFERENCES dbo.Father (FatherID);

ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Mother
    FOREIGN KEY (MotherID)
    REFERENCES dbo.Mother (MotherID);

-- Insert two children with the same parents

INSERT INTO dbo.Father
(
    FatherID
  , Name
)
VALUES
(1, 'Alex')
, (2, 'Bob')
, (3, 'Charlie')

INSERT INTO dbo.Mother
(
    MotherID
  , Name
)
VALUES
(1, 'Alice')
, (2, 'Barbara');

INSERT INTO dbo.Child
(
    ChildID
  , FatherID
  , MotherID
  , Name
)
VALUES
(1, 1, 1, 'Adam')
, (2, 1, 1, 'Billy')
, (3, 1, 2, 'Celine')
, (4, 2, 2, 'Derek')
, (5, 3, 1, 'Eric');

-- CTE Gets distinct combinations of parents
WITH distinctParentCombinations (FatherID, MotherID)
AS (SELECT children.FatherID
         , children.MotherID
    FROM dbo.Child as children
    GROUP BY children.FatherID
           , children.MotherID
   )
   -- CTE Gets uses STRING_AGG to get unique combinations of mothers.
   , motherGroups (Mothers)
AS (SELECT STRING_AGG(CONVERT(VARCHAR(MAX), distinctParentCombinations.MotherID), '-') WITHIN GROUP (ORDER BY distinctParentCombinations.MotherID) AS Mothers
    FROM distinctParentCombinations
    GROUP BY distinctParentCombinations.FatherID
   )

-- Remove the COUNT function to see the actual combinations
SELECT COUNT(motherGroups.Mothers) AS UniqueMotherGroups
FROM motherGroups


-- Clean up the example

DROP TABLE IF EXISTS dbo.Child;

DROP TABLE IF EXISTS dbo.Father;

DROP TABLE IF EXISTS dbo.Mother;

CodePudding user response：

Thank you for posting such a comprehensive setup for the test data. However, I'm not running any CREATE/DROP statements against my DB so I converted those tables into table variables. Using your data, I came up with the following query. Just change the table names back to your dbo. names and you should be able to test in your environment. I basically concatenate every father/mother combo into a text string using FOR XML PATH. Then I count up all the distinct combos. If you find error in my logic, let me know. I'm just tossing this in the ring of possible solutions.

WITH distinctCombos AS (
    SELECT DISTINCT
        c.FatherID, c.MotherID
    FROM @Child as c
) , motherComboCount AS (
    SELECT
        f.FatherID
        , f.[Name]
        , STUFF((
            SELECT
                ','   CAST(dc.MotherID as nvarchar)
            FROM distinctCombos as dc
            WHERE dc.FatherID = f.FatherID
            ORDER BY dc.MotherID ASC
            FOR XML PATH('')
        ),1,1,'') as motherList
    FROM @Father as f
)
SELECT
    COUNT(DISTINCT motherList) as UniqueMotherGroups
FROM motherComboCount as mcc

To save a bit of compute power, remove the STUFF function as it's not necessary for the comparison... it just makes the list nicer to look at if displaying... and I'm in the habit of using it.

It looks like the main differences between our methods is the use of FOR XML PATH vs STRING_AGG (I'm still on older SQL.) And I use DISTINCT twice instead of GROUP BY. If you have a larger dataset to test against, let me know how the 2 methods compare. I'm trying to think of a completely set-based method but I can't see it at the moment.

Update: Method 2. Here's an idea I had using recursive CTEs to build the distinct mother combinations. In your example data, there are only 2 mothers per father. So there would be a total of 4 set-based queries performed (first CTE, 2 queries in the recursive CTE and the final SELECT).

WITH uniqueCombo as (
    SELECT DISTINCT
        c.FatherID
        , c.MotherID
        , ROW_NUMBER() OVER(PARTITION BY c.FatherID ORDER BY c.MotherID) as row_num
    FROM @Child as c
), combos as (
    SELECT
        uc.FatherID
        , uc.MotherID
        , CAST(uc.MotherID as nvarchar(max)) as [path]
        , row_num
        , 0 as hierarchy_num
    FROM uniqueCombo as uc
    WHERE uc.row_num = 1

    UNION ALL

    SELECT
        uc.FatherID
        , uc.MotherID
        , co.[path]   ','   CAST(uc.MotherID as nvarchar(max))
        , uc.row_num
        , co.hierarchy_num   1 as heirarchy_num
    FROM uniqueCombo as uc
        INNER JOIN combos as co
            ON co.FatherID = uc.FatherID
            --AND co.MotherID <> uc.MotherID
            AND co.row_num   1 = uc.row_num
), rankedCombos as (
    SELECT 
        c.[path]
        , ROW_NUMBER() OVER(PARTITION BY c.FatherID ORDER BY c.hierarchy_num DESC) as row_num
    FROM combos as c
)
SELECT COUNT(DISTINCT rc.[path]) as UniqueMotherGroups
FROM rankedCombos as rc
WHERE rc.row_num = 1

Update 2: I had another idea to use a PIVOT to transpose the records so that the FatherID would be in the left-most column with the MotherIDs as the column headers. To make that work with a dynamic list of MotherIDs, you have to use a dynamic PIVOT/dynamic SQL. (FatherID isn't really needed in the PIVOT so it's not included in the PIVOT query. I just had to describe what the goal is.) After the pivot, you can SELECT DISTINCT to get the unique mother combinations. Then the last SELECT is to get the COUNT. This one I ran in SQL Fiddle:

SQL Fiddle

MS SQL Server 2017 Schema Setup:

-- Create family tables.

CREATE TABLE dbo.Father
(
    FatherID INT NOT NULL
  , Name VARCHAR(50) NOT NULL
);

ALTER TABLE dbo.Father
ADD CONSTRAINT PK_Father
    PRIMARY KEY CLUSTERED (FatherID);

ALTER TABLE dbo.Father SET (LOCK_ESCALATION = TABLE);

CREATE TABLE dbo.Mother
(
    MotherID INT NOT NULL
  , Name VARCHAR(50) NOT NULL
);

ALTER TABLE dbo.Mother
ADD CONSTRAINT PK_Mother
    PRIMARY KEY CLUSTERED (MotherID);

ALTER TABLE dbo.Mother SET (LOCK_ESCALATION = TABLE);

CREATE TABLE dbo.Child
(
    ChildID INT NOT NULL
  , FatherID INT NOT NULL
  , MotherID INT NOT NULL
  , Name VARCHAR(50) NOT NULL
);

ALTER TABLE dbo.Child
ADD CONSTRAINT PK_Child
    PRIMARY KEY CLUSTERED (ChildID);

CREATE NONCLUSTERED INDEX IX_Parents ON dbo.Child (FatherID, MotherID);

ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Father
    FOREIGN KEY (FatherID)
    REFERENCES dbo.Father (FatherID);

ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Mother
    FOREIGN KEY (MotherID)
    REFERENCES dbo.Mother (MotherID);

-- Insert two children with the same parents

INSERT INTO dbo.Father
(
    FatherID
  , Name
)
VALUES
(1, 'Alex')
, (2, 'Bob')
, (3, 'Charlie')

INSERT INTO dbo.Mother
(
    MotherID
  , Name
)
VALUES
(1, 'Alice')
, (2, 'Barbara');

INSERT INTO dbo.Child
(
    ChildID
  , FatherID
  , MotherID
  , Name
)
VALUES
(1, 1, 1, 'Adam')
, (2, 1, 1, 'Billy')
, (3, 1, 2, 'Celine')
, (4, 2, 2, 'Derek')
, (5, 3, 1, 'Eric');

Query 1:

DECLARE @cols AS nvarchar(MAX)
DECLARE @query AS nvarchar(MAX)

SET @cols = STUFF((
    SELECT DISTINCT ','   QUOTENAME(m.MotherID) 
    FROM Mother as m
    FOR XML PATH('')) 
,1,1,'')
    
SET @query = 'SELECT COUNT(mCount) as UniqueMotherGroups FROM (
    SELECT DISTINCT '   @cols   ', 1 as mCount FROM (
        SELECT  '   @cols   ' 
        FROM (
            SELECT
                c.FatherID
                , c.MotherID
                , 1 as mID
            FROM child as c
        ) x
        PIVOT 
        (
            MAX(mID)
            FOR MotherID in ('   @cols   ')
        ) p
    ) as m
) as mg'

--SELECT @query
Exec(@query)

Results:

| UniqueMotherGroups |
|--------------------|
|                  3 |

UPDATE 3: Here's one other idea... create a results table with a unique constraint and with IGNORE_DUP_KEY=ON. You could use this in a function or stored procedure, or, setup a trigger to put the mother combinations into a unique-combo-holding-table. With IGNORE_DUP_KEY=ON, you can insert every combo and only the unique combos will remain. Then just do a count of all the rows.

--Create a table to hold the results:
CREATE TABLE results (
  ChildID int not null
  , UniqueCombos nvarchar(50) not null
  PRIMARY KEY WITH (IGNORE_DUP_KEY = ON)
);

--Insert all combos into the results table. The unique constraint will cause only unique entries to remain.
INSERT INTO results (ChildID, UniqueCombos)
SELECT DISTINCT
  c.ChildID
  , (
    SELECT ','   CAST(MotherID as nvarchar(500))
    FROM Child as c2
    WHERE c2.ChildID = c.ChildID
    ORDER BY c2.MotherID
    FOR XML PATH('')
  ) as mother_combos
FROM Child as c
;

--Count up all the rows in the results table. Since these are all unique combinations, it should be fast to sum.
SELECT COUNT(*)
FROM results;

CodePudding user response：

If you accept to define a maximum number of mothers per father (here 7) you may try:

select count(*) as UniqueMotherGroups from (
select distinct m1, m2, m3, m4, m5, m6, m7 from (
    select FatherID, row_number() over(partition by FatherID order by motherid) as rn, motherid
    from (
        select distinct FatherID, MotherID
        from t_Child 
    )
)
pivot (
    max(motherid) for rn in (1 as m1,2 as m2,3 as m3,4 as m4,5 as m5,6 as m6,7 as m7)
)
)
;


UNIQUEMOTHERGROUPS
------------------
                 3