Home > OS >  Taking N-samples from each group in PostgreSQL
Taking N-samples from each group in PostgreSQL

Time:09-22

I have a table containing data that has a column named id that looks like below:

id value 1 value 2 value 3
1 244 550 1000
1 251 551 700
1 540 60 1200
... ... ... ...
2 19 744 2000
2 10 903 100
2 44 231 600
2 120 910 1100
... ... ... ...

I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points.

For example I would like a maximum 50 data points randomly selected from id = 1, id = 2 etc...

I cannot find any previous questions similar to this but have tried taking a stab at at least logically working through the solution where I could iterate and union all queries by id and limit to 50:

SELECT * FROM (SELECT * FROM schema.table AS tbl WHERE tbl.id = X LIMIT 50) UNION ALL;

But it's obvious that you cannot use this type of solution because UNION ALL requires aggregating outputs from one id to the next and I do not have a list of id values to use in place of X in tbl.id = X.

Is there a way to accomplish this by gathering that list of unique id values and union all results or is there a more optimal way this could be done?

CodePudding user response:

If you want to select a random sample for each id, then you need to randomize the rows somehow. Here is a way to do it:

select * from (
  select *, row_number() over (partition by id order by random()) as u
  from schema.table
) as a
where u <= 50;

Example (limiting to 3, and some row number for each id so you can see the selection randomness):

  1. setup
DROP TABLE IF EXISTS foo;

CREATE TABLE foo
(
  id int,
  value1 int,
  idrow int
  );

INSERT INTO foo
select 1 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 2 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 3 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow;
  1. Selection
select * from (
  select *, row_number() over (partition by id order by random()) as u
  from foo
) as a
where u <= 3;

Output:

id value1 idrow u
1 542 6 1
1 24 86 2
1 155 74 3
2 505 95 1
2 100 46 2
2 422 33 3
3 966 88 1
3 747 89 2
3 664 19 3

CodePudding user response:

In case you are looking to get 50 (or less) from each group of IDs then you can use windowing -

From question - "I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points."

Query -

with data as (
select row_number() over (partition by id order by random()) rn,
* from table_name)
select * from data where rn<=50 order by id;

Fiddle.

CodePudding user response:

Your description of trying to get the UNION ALL without specifying all the branches ahead of time is aiming for a LATERAL join. And that is one way to solve the problem. But unless you have a table of all distinct ids, you would have to compute one on the fly. For example (using the same fiddle as Pankaj used):

with uniq as (select distinct id from test)
select foo.* from uniq cross join lateral 
   (select * from test where test.id=uniq.id order by random() limit 3) foo

This could be either slower or faster than the Window Function method, depending on your system and your data and your indexes. In my hands, it was quite a bit faster even with the need to dynamically compute the list of distinct ids.

  • Related