I have a table containing data that has a column named id
that looks like below:
id | value 1 | value 2 | value 3 |
---|---|---|---|
1 | 244 | 550 | 1000 |
1 | 251 | 551 | 700 |
1 | 540 | 60 | 1200 |
... | ... | ... | ... |
2 | 19 | 744 | 2000 |
2 | 10 | 903 | 100 |
2 | 44 | 231 | 600 |
2 | 120 | 910 | 1100 |
... | ... | ... | ... |
I want to take 50 sample rows per id
that exists but if less than 50 exist for the group to simply take the entire set of data points.
For example I would like a maximum 50 data points randomly selected from id = 1
, id = 2
etc...
I cannot find any previous questions similar to this but have tried taking a stab at at least logically working through the solution where I could iterate and union all queries by id
and limit to 50:
SELECT * FROM (SELECT * FROM schema.table AS tbl WHERE tbl.id = X LIMIT 50) UNION ALL;
But it's obvious that you cannot use this type of solution because UNION ALL
requires aggregating outputs from one id to the next and I do not have a list of id
values to use in place of X in tbl.id = X
.
Is there a way to accomplish this by gathering that list of unique id
values and union all results or is there a more optimal way this could be done?
CodePudding user response:
If you want to select a random sample for each id
, then you need to randomize the rows somehow. Here is a way to do it:
select * from (
select *, row_number() over (partition by id order by random()) as u
from schema.table
) as a
where u <= 50;
Example (limiting to 3, and some row number for each id
so you can see the selection randomness):
- setup
DROP TABLE IF EXISTS foo;
CREATE TABLE foo
(
id int,
value1 int,
idrow int
);
INSERT INTO foo
select 1 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 2 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 3 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow;
- Selection
select * from (
select *, row_number() over (partition by id order by random()) as u
from foo
) as a
where u <= 3;
Output:
id | value1 | idrow | u |
---|---|---|---|
1 | 542 | 6 | 1 |
1 | 24 | 86 | 2 |
1 | 155 | 74 | 3 |
2 | 505 | 95 | 1 |
2 | 100 | 46 | 2 |
2 | 422 | 33 | 3 |
3 | 966 | 88 | 1 |
3 | 747 | 89 | 2 |
3 | 664 | 19 | 3 |
CodePudding user response:
In case you are looking to get 50 (or less) from each group of IDs then you can use windowing -
From question - "I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points."
Query -
with data as (
select row_number() over (partition by id order by random()) rn,
* from table_name)
select * from data where rn<=50 order by id;
CodePudding user response:
Your description of trying to get the UNION ALL without specifying all the branches ahead of time is aiming for a LATERAL join. And that is one way to solve the problem. But unless you have a table of all distinct ids, you would have to compute one on the fly. For example (using the same fiddle as Pankaj used):
with uniq as (select distinct id from test)
select foo.* from uniq cross join lateral
(select * from test where test.id=uniq.id order by random() limit 3) foo
This could be either slower or faster than the Window Function method, depending on your system and your data and your indexes. In my hands, it was quite a bit faster even with the need to dynamically compute the list of distinct ids.