I have a Google Big Query table called TableA that has ~3M records. There is a column called DimA (Dimension A) that has 20 values - 1 to 20. The counts by each value of DimA is shown in the summary table below in the Total column. I did some analysis and determined how much random sample I should draw from each value of DimA and it is shown in the column Sample. The % of sample drawn by each value of DimA is shown in column DimA_value_perc. I know how to do sample via brute force using the code below the table. However, this code is not scalable as the number of values of DimA grows and in case there are additional dimensions. Is there a more efficient way to do the stratified sampling? Thanks.
DimA | Total | Sample | DimA_value_perc |
---|---|---|---|
1 | 115,623 | 3,077 | 3% |
2 | 108,203 | 3,943 | 4% |
3 | 153,477 | 6,802 | 4% |
4 | 232,252 | 12,426 | 5% |
5 | 223,004 | 14,052 | 6% |
6 | 242,386 | 17,589 | 7% |
7 | 121,519 | 9,783 | 8% |
8 | 371,342 | 34,026 | 9% |
9 | 147,683 | 15,400 | 10% |
10 | 281,101 | 32,775 | 12% |
11 | 93,380 | 12,075 | 13% |
12 | 181,293 | 25,675 | 14% |
13 | 122,206 | 19,344 | 16% |
14 | 140,559 | 25,141 | 18% |
15 | 95,576 | 19,498 | 20% |
16 | 94,319 | 21,969 | 23% |
17 | 108,282 | 30,054 | 28% |
18 | 94,920 | 33,228 | 35% |
19 | 82,764 | 39,700 | 48% |
20 | 28,417 | 23,442 | 82% |
Grand Total | 3,038,306 | 400,000 |
SELECT *
FROM tableA
where DimA = 1
order by rand()
limit 3077
union all
SELECT *
FROM tableA
where DimA = 2
order by rand()
limit 3943
etc
CodePudding user response:
Consider below approach
with samples as (
select 1 DimA, 115623 Total, 3077 Sample, 3 DimA_value_perc union all
select 2, 108203, 3943, 4 union all
select 3, 153477, 6802, 4 union all
select 4, 232252, 12426, 5 union all
select 5, 223004, 14052, 6 union all
select 6, 242386, 17589, 7 union all
select 7, 121519, 9783, 8 union all
select 8, 371342, 34026, 9 union all
select 9, 147683, 15400, 10 union all
select 10, 281101, 32775, 12 union all
select 11, 93380, 12075, 13 union all
select 12, 181293, 25675, 14 union all
select 13, 122206, 19344, 16 union all
select 14, 140559, 25141, 18 union all
select 15, 95576, 19498, 20 union all
select 16, 94319, 21969, 23 union all
select 17, 108282, 30054, 28 union all
select 18, 94920, 33228, 35 union all
select 19, 82764, 39700, 48 union all
select 20, 28417, 23442, 82
)
select a.* except(pos) from (
select *, row_number() over(partition by DimA order by rand()) pos
from tableA
) a
join samples
using(DimA)
where pos <= Sample