efficient way to do stratified sampling in big query-CodePudding

I have a Google Big Query table called TableA that has ~3M records. There is a column called DimA (Dimension A) that has 20 values - 1 to 20. The counts by each value of DimA is shown in the summary table below in the Total column. I did some analysis and determined how much random sample I should draw from each value of DimA and it is shown in the column Sample. The % of sample drawn by each value of DimA is shown in column DimA_value_perc. I know how to do sample via brute force using the code below the table. However, this code is not scalable as the number of values of DimA grows and in case there are additional dimensions. Is there a more efficient way to do the stratified sampling? Thanks.

DimA	Total	Sample	DimA_value_perc
1	115,623	3,077	3%
2	108,203	3,943	4%
3	153,477	6,802	4%
4	232,252	12,426	5%
5	223,004	14,052	6%
6	242,386	17,589	7%
7	121,519	9,783	8%
8	371,342	34,026	9%
9	147,683	15,400	10%
10	281,101	32,775	12%
11	93,380	12,075	13%
12	181,293	25,675	14%
13	122,206	19,344	16%
14	140,559	25,141	18%
15	95,576	19,498	20%
16	94,319	21,969	23%
17	108,282	30,054	28%
18	94,920	33,228	35%
19	82,764	39,700	48%
20	28,417	23,442	82%
Grand Total	3,038,306	400,000

SELECT *
FROM tableA
where DimA = 1
order by rand()
limit  3077 

union all

SELECT *
FROM tableA
where DimA = 2
order by rand()
limit   3943 

etc

CodePudding user response：

Consider below approach

with samples as (
  select 1 DimA, 115623 Total, 3077 Sample, 3 DimA_value_perc union all
  select 2, 108203, 3943, 4 union all
  select 3, 153477, 6802, 4 union all
  select 4, 232252, 12426, 5 union all
  select 5, 223004, 14052, 6 union all
  select 6, 242386, 17589, 7 union all
  select 7, 121519, 9783, 8 union all
  select 8, 371342, 34026, 9 union all
  select 9, 147683, 15400, 10 union all
  select 10, 281101, 32775, 12 union all
  select 11, 93380, 12075, 13 union all
  select 12, 181293, 25675, 14 union all
  select 13, 122206, 19344, 16 union all
  select 14, 140559, 25141, 18 union all
  select 15, 95576, 19498, 20 union all
  select 16, 94319, 21969, 23 union all
  select 17, 108282, 30054, 28 union all
  select 18, 94920, 33228, 35 union all
  select 19, 82764, 39700, 48 union all
  select 20, 28417, 23442, 82 
)
select a.* except(pos) from (
  select *, row_number() over(partition by DimA order by rand()) pos
  from tableA
) a
join samples 
using(DimA)
where pos <= Sample