Sample rows by group from one table based on size info from another in SQL Server/Vertica-CodePudding

I need to sample N rows from one table, based on size info (by group) from another table.

I wish to automate the sampling process as both the sample sizes and groups change often. Writing the sampling manually (see below) is inefficient.

My input tables are:

#PERSON_LIST - List of names and their group (here called Team A, B or C) from which I have to randomly sample N rows

GROUP_NAME	PERSON_NAME
Team A	Bruce
Team A	Barbara
Team A	Alfred
...	...
Team B	Erik
...	...
Team C	Tony

#SAMPLE_INFO - Number of names by group - or sample size (N rows) that I have to select from #PERSON_LIST

GROUP_NAME	REQ_SAMPLE_COUNT
Team A	5
Team B	3
Team C	2

What I Tried

One option is to explicitly sample from each group. Since in my case, the list of names and groups (sizes) change often, this is not a viable solution in the long term.

Is there a way to write this more algorithmically or should I use Python instead?

Solution in either MS SQL or Vertica are fine.

SELECT * FROM (SELECT TOP 5 * FROM #PERSON_LIST WHERE GROUP_NAME = 'Team A' ORDER BY NEWID()) AS T1
UNION
SELECT * FROM (SELECT TOP 3 * FROM #PERSON_LIST WHERE GROUP_NAME = 'Team B' ORDER BY NEWID()) AS T2
UNION
SELECT * FROM (SELECT TOP 2 * FROM #PERSON_LIST WHERE GROUP_NAME = 'Team C' ORDER BY NEWID()) AS T3
;

Sample Code for Tables (SQL Server)

CREATE TABLE #SAMPLE_INFO (
    GROUP_NAME         VARCHAR(128),
    REQ_SAMPLE_COUNT   INT,
);


INSERT INTO #SAMPLE_INFO (GROUP_NAME, REQ_SAMPLE_COUNT)
VALUES ('Team A', 5), ('Team B', 3), ('Team C', 2);


CREATE TABLE #PERSON_LIST (
    GROUP_NAME         VARCHAR(128),
    PERSON_NAME        VARCHAR(128)
);


INSERT INTO #PERSON_LIST (GROUP_NAME, PERSON_NAME)
VALUES ('Team A', 'Bruce'), ('Team A', 'Barbara'), ('Team A', 'Alfred'), ('Team A', 'Jason'), ('Team A', 'Lucius'),
       ('Team A', 'Harvey'), ('Team A', 'Selina'), ('Team A', 'Victor'), ('Team A', 'Pamela'), ('Team A', 'Edward'),
       ('Team B', 'Erik'), ('Team B', 'Charles'), ('Team B', 'Logan'), ('Team B', 'Hank'), ('Team B', 'Scott'),
       ('Team C', 'Tony'), ('Team C', 'Thor'), ('Team C', 'Henry'), ('Team C', 'Steve'), ('Team C', 'Nick');


-- SELECT * FROM #SAMPLE_INFO;
-- SELECT * FROM #PERSON_LIST;

CodePudding user response：

This would SELECT the number REQ_SAMPLE_COUNT of Person from every Group

the selected names are in this case random as you have no order in your personlist table, that can be applied

So the sample result as shown here, will not be the same, when you run it

WITH CTE AS
(SELECT
p.GROUP_NAME, p.PERSON_NAME,REQ_SAMPLE_COUNT,
  ROW_NUMBER() OVER(PARTITION  BY p.GROUP_NAME ORDER BY NEWID()) as rn
FROM #PERSON_LIST p JOIN #SAMPLE_INFO s ON s.GROUP_NAME = p.GROUP_NAME)
SELECT GROUP_NAME, PERSON_NAME FROM CTE WHERE rn <= REQ_SAMPLE_COUNT

GROUP_NAME	PERSON_NAME
Team A	Jason
Team A	Victor
Team A	Barbara
Team A	Pamela
Team A	Lucius
Team B	Hank
Team B	Logan
Team B	Erik
Team C	Steve
Team C	Thor

fiddle