Taking N-samples from each group in PostgreSQL-CodePudding

I have a table containing data that has a column named id that looks like below:

id	value 1	value 2	value 3
1	244	550	1000
1	251	551	700
1	540	60	1200
...	...	...	...
2	19	744	2000
2	10	903	100
2	44	231	600
2	120	910	1100
...	...	...	...

I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points.

For example I would like a maximum 50 data points randomly selected from id = 1, id = 2 etc...

I cannot find any previous questions similar to this but have tried taking a stab at at least logically working through the solution where I could iterate and union all queries by id and limit to 50:

SELECT * FROM (SELECT * FROM schema.table AS tbl WHERE tbl.id = X LIMIT 50) UNION ALL;

But it's obvious that you cannot use this type of solution because UNION ALL requires aggregating outputs from one id to the next and I do not have a list of id values to use in place of X in tbl.id = X.

Is there a way to accomplish this by gathering that list of unique id values and union all results or is there a more optimal way this could be done?

CodePudding user response：

If you want to select a random sample for each id, then you need to randomize the rows somehow. Here is a way to do it:

select * from (
  select *, row_number() over (partition by id order by random()) as u
  from schema.table
) as a
where u <= 50;

Example (limiting to 3, and some row number for each id so you can see the selection randomness):

setup

DROP TABLE IF EXISTS foo;

CREATE TABLE foo
(
  id int,
  value1 int,
  idrow int
  );

INSERT INTO foo
select 1 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 2 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 3 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow;

Selection

select * from (
  select *, row_number() over (partition by id order by random()) as u
  from foo
) as a
where u <= 3;

Output:

id	value1	idrow	u
1	542	6	1
1	24	86	2
1	155	74	3
2	505	95	1
2	100	46	2
2	422	33	3
3	966	88	1
3	747	89	2
3	664	19	3

CodePudding user response：

In case you are looking to get 50 (or less) from each group of IDs then you can use windowing -

From question - "I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points."

Query -

with data as (
select row_number() over (partition by id order by random()) rn,
* from table_name)
select * from data where rn<=50 order by id;

Fiddle.

CodePudding user response：

Your description of trying to get the UNION ALL without specifying all the branches ahead of time is aiming for a LATERAL join. And that is one way to solve the problem. But unless you have a table of all distinct ids, you would have to compute one on the fly. For example (using the same fiddle as Pankaj used):

with uniq as (select distinct id from test)
select foo.* from uniq cross join lateral 
   (select * from test where test.id=uniq.id order by random() limit 3) foo

This could be either slower or faster than the Window Function method, depending on your system and your data and your indexes. In my hands, it was quite a bit faster even with the need to dynamically compute the list of distinct ids.

id	value1	idrow	u
1	542	6	1
1	24	86	2
1	155	74	3
2	505	95	1
2	100	46	2
2	422	33	3
3	966	88	1
3	747	89	2
3	664	19	3

id	value1	idrow	u
1	542	6	1
1	24	86	2
1	155	74	3
2	505	95	1
2	100	46	2
2	422	33	3
3	966	88	1
3	747	89	2
3	664	19	3

id	value1	idrow	u
1	542	6	1
1	24	86	2
1	155	74	3
2	505	95	1
2	100	46	2
2	422	33	3
3	966	88	1
3	747	89	2
3	664	19	3