I need to select the count of unique value combinations of column B in an XRef table which is grouped by column A.
Consider the following schema and data, which represents a simple family structure. Each child has a father and mother:
TABLE Father
FatherID | Name |
---|---|
1 | Alex |
2 | Bob |
TABLE Mother
MotherID | Name |
---|---|
1 | Alice |
2 | Barbara |
TABLE Child
ChildID | FatherID | MotherID | Name |
---|---|---|---|
1 | 1 (Alex) | 1 (Alice) | Adam |
2 | 1 (Alex) | 1 (Alice) | Billy |
3 | 1 (Alex) | 2 (Barbara) | Celine |
4 | 2 (Bob) | 2 (Barbara) | Derek |
The distinct combinations of mothers for each father are:
- Alex (Alice, Barbara)
- Bob (Barbara)
In all there are two distinct combinations of mothers:
- Alice, Barbara
- Barbara
The query I want to write would return the count of those distinct combinations of mother, regardless of which father they are associated with:
UniqueMotherGroups |
---|
2 |
I was able to do this successfully using the STRING_AGG function, but it feels clunky. It also needs to operate over millions of rows and is quite slow at the moment. Is there a more idiomatic way to do this with set operations instead?
Here is my working example:
-- Drop pre-existing tables
DROP TABLE IF EXISTS dbo.Child;
DROP TABLE IF EXISTS dbo.Father;
DROP TABLE IF EXISTS dbo.Mother;
-- Create family tables.
CREATE TABLE dbo.Father
(
FatherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Father
ADD CONSTRAINT PK_Father
PRIMARY KEY CLUSTERED (FatherID);
ALTER TABLE dbo.Father SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Mother
(
MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Mother
ADD CONSTRAINT PK_Mother
PRIMARY KEY CLUSTERED (MotherID);
ALTER TABLE dbo.Mother SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Child
(
ChildID INT NOT NULL
, FatherID INT NOT NULL
, MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Child
ADD CONSTRAINT PK_Child
PRIMARY KEY CLUSTERED (ChildID);
CREATE NONCLUSTERED INDEX IX_Parents ON dbo.Child (FatherID, MotherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Father
FOREIGN KEY (FatherID)
REFERENCES dbo.Father (FatherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Mother
FOREIGN KEY (MotherID)
REFERENCES dbo.Mother (MotherID);
-- Insert two children with the same parents
INSERT INTO dbo.Father
(
FatherID
, Name
)
VALUES
(1, 'Alex')
, (2, 'Bob')
, (3, 'Charlie')
INSERT INTO dbo.Mother
(
MotherID
, Name
)
VALUES
(1, 'Alice')
, (2, 'Barbara');
INSERT INTO dbo.Child
(
ChildID
, FatherID
, MotherID
, Name
)
VALUES
(1, 1, 1, 'Adam')
, (2, 1, 1, 'Billy')
, (3, 1, 2, 'Celine')
, (4, 2, 2, 'Derek')
, (5, 3, 1, 'Eric');
-- CTE Gets distinct combinations of parents
WITH distinctParentCombinations (FatherID, MotherID)
AS (SELECT children.FatherID
, children.MotherID
FROM dbo.Child as children
GROUP BY children.FatherID
, children.MotherID
)
-- CTE Gets uses STRING_AGG to get unique combinations of mothers.
, motherGroups (Mothers)
AS (SELECT STRING_AGG(CONVERT(VARCHAR(MAX), distinctParentCombinations.MotherID), '-') WITHIN GROUP (ORDER BY distinctParentCombinations.MotherID) AS Mothers
FROM distinctParentCombinations
GROUP BY distinctParentCombinations.FatherID
)
-- Remove the COUNT function to see the actual combinations
SELECT COUNT(motherGroups.Mothers) AS UniqueMotherGroups
FROM motherGroups
-- Clean up the example
DROP TABLE IF EXISTS dbo.Child;
DROP TABLE IF EXISTS dbo.Father;
DROP TABLE IF EXISTS dbo.Mother;
CodePudding user response:
Thank you for posting such a comprehensive setup for the test data. However, I'm not running any CREATE
/DROP
statements against my DB so I converted those tables into table variables. Using your data, I came up with the following query. Just change the table names back to your dbo. names and you should be able to test in your environment. I basically concatenate every father/mother combo into a text string using FOR XML PATH
. Then I count up all the distinct combos. If you find error in my logic, let me know. I'm just tossing this in the ring of possible solutions.
WITH distinctCombos AS (
SELECT DISTINCT
c.FatherID, c.MotherID
FROM @Child as c
) , motherComboCount AS (
SELECT
f.FatherID
, f.[Name]
, STUFF((
SELECT
',' CAST(dc.MotherID as nvarchar)
FROM distinctCombos as dc
WHERE dc.FatherID = f.FatherID
ORDER BY dc.MotherID ASC
FOR XML PATH('')
),1,1,'') as motherList
FROM @Father as f
)
SELECT
COUNT(DISTINCT motherList) as UniqueMotherGroups
FROM motherComboCount as mcc
To save a bit of compute power, remove the STUFF
function as it's not necessary for the comparison... it just makes the list nicer to look at if displaying... and I'm in the habit of using it.
It looks like the main differences between our methods is the use of FOR XML PATH
vs STRING_AGG
(I'm still on older SQL.) And I use DISTINCT
twice instead of GROUP BY
. If you have a larger dataset to test against, let me know how the 2 methods compare. I'm trying to think of a completely set-based method but I can't see it at the moment.
Update: Method 2. Here's an idea I had using recursive CTEs to build the distinct mother combinations. In your example data, there are only 2 mothers per father. So there would be a total of 4 set-based queries performed (first CTE, 2 queries in the recursive CTE and the final SELECT).
WITH uniqueCombo as (
SELECT DISTINCT
c.FatherID
, c.MotherID
, ROW_NUMBER() OVER(PARTITION BY c.FatherID ORDER BY c.MotherID) as row_num
FROM @Child as c
), combos as (
SELECT
uc.FatherID
, uc.MotherID
, CAST(uc.MotherID as nvarchar(max)) as [path]
, row_num
, 0 as hierarchy_num
FROM uniqueCombo as uc
WHERE uc.row_num = 1
UNION ALL
SELECT
uc.FatherID
, uc.MotherID
, co.[path] ',' CAST(uc.MotherID as nvarchar(max))
, uc.row_num
, co.hierarchy_num 1 as heirarchy_num
FROM uniqueCombo as uc
INNER JOIN combos as co
ON co.FatherID = uc.FatherID
--AND co.MotherID <> uc.MotherID
AND co.row_num 1 = uc.row_num
), rankedCombos as (
SELECT
c.[path]
, ROW_NUMBER() OVER(PARTITION BY c.FatherID ORDER BY c.hierarchy_num DESC) as row_num
FROM combos as c
)
SELECT COUNT(DISTINCT rc.[path]) as UniqueMotherGroups
FROM rankedCombos as rc
WHERE rc.row_num = 1
Update 2:
I had another idea to use a PIVOT
to transpose the records so that the FatherID would be in the left-most column with the MotherIDs as the column headers. To make that work with a dynamic list of MotherIDs, you have to use a dynamic PIVOT/dynamic SQL. (FatherID isn't really needed in the PIVOT
so it's not included in the PIVOT query. I just had to describe what the goal is.) After the pivot, you can SELECT DISTINCT
to get the unique mother combinations. Then the last SELECT
is to get the COUNT
. This one I ran in SQL Fiddle:
MS SQL Server 2017 Schema Setup:
-- Create family tables.
CREATE TABLE dbo.Father
(
FatherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Father
ADD CONSTRAINT PK_Father
PRIMARY KEY CLUSTERED (FatherID);
ALTER TABLE dbo.Father SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Mother
(
MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Mother
ADD CONSTRAINT PK_Mother
PRIMARY KEY CLUSTERED (MotherID);
ALTER TABLE dbo.Mother SET (LOCK_ESCALATION = TABLE);
CREATE TABLE dbo.Child
(
ChildID INT NOT NULL
, FatherID INT NOT NULL
, MotherID INT NOT NULL
, Name VARCHAR(50) NOT NULL
);
ALTER TABLE dbo.Child
ADD CONSTRAINT PK_Child
PRIMARY KEY CLUSTERED (ChildID);
CREATE NONCLUSTERED INDEX IX_Parents ON dbo.Child (FatherID, MotherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Father
FOREIGN KEY (FatherID)
REFERENCES dbo.Father (FatherID);
ALTER TABLE dbo.Child
ADD CONSTRAINT FK_Child_Mother
FOREIGN KEY (MotherID)
REFERENCES dbo.Mother (MotherID);
-- Insert two children with the same parents
INSERT INTO dbo.Father
(
FatherID
, Name
)
VALUES
(1, 'Alex')
, (2, 'Bob')
, (3, 'Charlie')
INSERT INTO dbo.Mother
(
MotherID
, Name
)
VALUES
(1, 'Alice')
, (2, 'Barbara');
INSERT INTO dbo.Child
(
ChildID
, FatherID
, MotherID
, Name
)
VALUES
(1, 1, 1, 'Adam')
, (2, 1, 1, 'Billy')
, (3, 1, 2, 'Celine')
, (4, 2, 2, 'Derek')
, (5, 3, 1, 'Eric');
Query 1:
DECLARE @cols AS nvarchar(MAX)
DECLARE @query AS nvarchar(MAX)
SET @cols = STUFF((
SELECT DISTINCT ',' QUOTENAME(m.MotherID)
FROM Mother as m
FOR XML PATH(''))
,1,1,'')
SET @query = 'SELECT COUNT(mCount) as UniqueMotherGroups FROM (
SELECT DISTINCT ' @cols ', 1 as mCount FROM (
SELECT ' @cols '
FROM (
SELECT
c.FatherID
, c.MotherID
, 1 as mID
FROM child as c
) x
PIVOT
(
MAX(mID)
FOR MotherID in (' @cols ')
) p
) as m
) as mg'
--SELECT @query
Exec(@query)
| UniqueMotherGroups |
|--------------------|
| 3 |
UPDATE 3: Here's one other idea... create a results table with a unique constraint and with IGNORE_DUP_KEY=ON. You could use this in a function or stored procedure, or, setup a trigger to put the mother combinations into a unique-combo-holding-table. With IGNORE_DUP_KEY=ON, you can insert every combo and only the unique combos will remain. Then just do a count of all the rows.
--Create a table to hold the results:
CREATE TABLE results (
ChildID int not null
, UniqueCombos nvarchar(50) not null
PRIMARY KEY WITH (IGNORE_DUP_KEY = ON)
);
--Insert all combos into the results table. The unique constraint will cause only unique entries to remain.
INSERT INTO results (ChildID, UniqueCombos)
SELECT DISTINCT
c.ChildID
, (
SELECT ',' CAST(MotherID as nvarchar(500))
FROM Child as c2
WHERE c2.ChildID = c.ChildID
ORDER BY c2.MotherID
FOR XML PATH('')
) as mother_combos
FROM Child as c
;
--Count up all the rows in the results table. Since these are all unique combinations, it should be fast to sum.
SELECT COUNT(*)
FROM results;
CodePudding user response:
If you accept to define a maximum number of mothers per father (here 7) you may try:
select count(*) as UniqueMotherGroups from (
select distinct m1, m2, m3, m4, m5, m6, m7 from (
select FatherID, row_number() over(partition by FatherID order by motherid) as rn, motherid
from (
select distinct FatherID, MotherID
from t_Child
)
)
pivot (
max(motherid) for rn in (1 as m1,2 as m2,3 as m3,4 as m4,5 as m5,6 as m6,7 as m7)
)
)
;
UNIQUEMOTHERGROUPS
------------------
3