I need to sample N
rows from one table, based on size info (by group) from another table.
I wish to automate the sampling process as both the sample sizes and groups change often. Writing the sampling manually (see below) is inefficient.
My input tables are:
#PERSON_LIST - List of names and their group (here called Team A, B or C) from which I have to randomly sample N
rows
GROUP_NAME | PERSON_NAME |
---|---|
Team A | Bruce |
Team A | Barbara |
Team A | Alfred |
... | ... |
Team B | Erik |
... | ... |
Team C | Tony |
#SAMPLE_INFO - Number of names by group - or sample size (N
rows) that I have to select from #PERSON_LIST
GROUP_NAME | REQ_SAMPLE_COUNT |
---|---|
Team A | 5 |
Team B | 3 |
Team C | 2 |
What I Tried
One option is to explicitly sample from each group. Since in my case, the list of names and groups (sizes) change often, this is not a viable solution in the long term.
Is there a way to write this more algorithmically or should I use Python instead?
Solution in either MS SQL or Vertica are fine.
SELECT * FROM (SELECT TOP 5 * FROM #PERSON_LIST WHERE GROUP_NAME = 'Team A' ORDER BY NEWID()) AS T1
UNION
SELECT * FROM (SELECT TOP 3 * FROM #PERSON_LIST WHERE GROUP_NAME = 'Team B' ORDER BY NEWID()) AS T2
UNION
SELECT * FROM (SELECT TOP 2 * FROM #PERSON_LIST WHERE GROUP_NAME = 'Team C' ORDER BY NEWID()) AS T3
;
Sample Code for Tables (SQL Server)
CREATE TABLE #SAMPLE_INFO (
GROUP_NAME VARCHAR(128),
REQ_SAMPLE_COUNT INT,
);
INSERT INTO #SAMPLE_INFO (GROUP_NAME, REQ_SAMPLE_COUNT)
VALUES ('Team A', 5), ('Team B', 3), ('Team C', 2);
CREATE TABLE #PERSON_LIST (
GROUP_NAME VARCHAR(128),
PERSON_NAME VARCHAR(128)
);
INSERT INTO #PERSON_LIST (GROUP_NAME, PERSON_NAME)
VALUES ('Team A', 'Bruce'), ('Team A', 'Barbara'), ('Team A', 'Alfred'), ('Team A', 'Jason'), ('Team A', 'Lucius'),
('Team A', 'Harvey'), ('Team A', 'Selina'), ('Team A', 'Victor'), ('Team A', 'Pamela'), ('Team A', 'Edward'),
('Team B', 'Erik'), ('Team B', 'Charles'), ('Team B', 'Logan'), ('Team B', 'Hank'), ('Team B', 'Scott'),
('Team C', 'Tony'), ('Team C', 'Thor'), ('Team C', 'Henry'), ('Team C', 'Steve'), ('Team C', 'Nick');
-- SELECT * FROM #SAMPLE_INFO;
-- SELECT * FROM #PERSON_LIST;
CodePudding user response:
This would SELECT the number REQ_SAMPLE_COUNT of Person from every Group
the selected names are in this case random as you have no order in your personlist table, that can be applied
So the sample result as shown here, will not be the same, when you run it
WITH CTE AS
(SELECT
p.GROUP_NAME, p.PERSON_NAME,REQ_SAMPLE_COUNT,
ROW_NUMBER() OVER(PARTITION BY p.GROUP_NAME ORDER BY NEWID()) as rn
FROM #PERSON_LIST p JOIN #SAMPLE_INFO s ON s.GROUP_NAME = p.GROUP_NAME)
SELECT GROUP_NAME, PERSON_NAME FROM CTE WHERE rn <= REQ_SAMPLE_COUNT
GROUP_NAME | PERSON_NAME |
---|---|
Team A | Jason |
Team A | Victor |
Team A | Barbara |
Team A | Pamela |
Team A | Lucius |
Team B | Hank |
Team B | Logan |
Team B | Erik |
Team C | Steve |
Team C | Thor |