Home > Mobile >  SQL - Sample and Filter by the grouping variable
SQL - Sample and Filter by the grouping variable

Time:12-13

I have a table that looks like the following

V1 |  Group  |  time    
----------------------
a  |     g1  |   1      
b  |     g1  |   2      
c  |     g1  |   3      
d  |     g2  |   1      
e  |     g2  |   2      
f  |     g2  |   3      
g  |     g3  |   5
h  |     g3  |   6
i  |     g3  |   7

I wish to filter only a sample Groups, but take all of the associated examples with the filtered groups. For example, assuming that the desired sample size is 2 groups, and (g1,g3) were sampled, the desired table will be

V1 |  Group  |  time    
----------------------
a  |     g1  |   1      
b  |     g1  |   2      
c  |     g1  |   3 
g  |     g3  |   5
h  |     g3  |   6
i  |     g3  |   7

CodePudding user response:

I think this is what you want.

Note that I renamed group to g because it is usually a reserved SQL Keyword and that you may have to use another rand function depending on your DB (I assume SQL Server here).

drop table if exists #have;

CREATE TABLE #have 
(
  V1    VARCHAR(3) NOT NULL
, g     VARCHAR(3)       
, time  INT
);


insert into #have
values
  ('a', 'g1', 1)
, ('b', 'g1', 2)
, ('c', 'g1', 3)
, ('d', 'g2', 1)
, ('e', 'g2', 2)
, ('f', 'g2', 3)
, ('g', 'g3', 5)
, ('h', 'g3', 6)
, ('i', 'g3', 7)
;

select a.*
from #have a join
(
select top 2 g
from #have
group by g
order by newid()
) b
on a.g = b.g;

CodePudding user response:

A simple query like this would give you the results you want:

SELECT * FROM table WHERE Group IN('g1','g3');

Alternative:

SELECT * FROM table WHERE Group = 'g1' OR Group = 'g3';

Note GROUP is a reserved word in most RDBMSs so you probably need to escape them. In MySQL/MariaDB you can use backticks:

SELECT * FROM `table` WHERE `Group` IN('g1','g3');

If ANSI SQL mode is enabled you can use double quotes :

SELECT * FROM "table" WHERE "Group" IN('g1','g3');

In T-SQL you can use brackets:

SELECT * FROM [table] WHERE [Group] IN('g1','g3');

If you want a randomized sample of two groups, and use SQL Server you should go with PeterClemmensens excellent answer

With a few changes in syntax his solution works well also for MySQL/MariaDB 5.5 and above:

SELECT a.*
FROM `table` a JOIN (
     SELECT `Group`
     FROM `table`
     GROUP BY `Group`
     ORDER BY RAND()
     LIMIT 2
) b
ON a.`Group` = b.`Group`;
  • Related