Randomly select rows in BigQuery table for each ID-CodePudding

I have a Biguery table consisting of multiple entries for each ID for each day. Basically, the IDs are stores with a list of products for which 2 columns represent properties.

   Store Product       Date  property1  property2
0    ID1      A1  202212-01          1          5
1    ID1      A1  202212-02          2          6
2    ID1      A1  202212-03          3          7
3    ID1      A1  202212-04          4          8
4    ID1      A1  202212-05          5          9
5    ID1      A1  202212-06          6         10
6    ID1      A1  202212-07          7         11
7    ID1      A1  202212-08          8         12
8    ID1      A1  202212-09          9         13
9    ID1      A1  202212-10         10         14
10   ID1      A2  202212-01         11         15
11   ID1      A2  202212-02         12         16
12   ID1      A2  202212-03         13         17
13   ID1      A2  202212-04         14         18
14   ID1      A2  202212-05         15         19
15   ID1      A2  202212-06         16         20
16   ID1      A2  202212-07         17         21
17   ID1      A2  202212-08         18         22
18   ID1      A2  202212-09         19         23
19   ID1      A2  202212-10         20         24
20   ID2      B1  202212-01         21         25
21   ID2      B1  202212-02         22         26
22   ID2      B1  202212-03         23         27
23   ID2      B1  202212-04         24         28
24   ID2      B1  202212-05         25         29
25   ID2      B1  202212-06         26         30
26   ID2      B1  202212-07         27         31
27   ID2      B1  202212-08         28         32
28   ID2      B1  202212-09         29         33
29   ID2      B1  202212-10         30         34
30   ID2      B2  202212-01         31         35
31   ID2      B2  202212-02         32         36
32   ID2      B2  202212-03         33         37
33   ID2      B2  202212-04         34         38
34   ID2      B2  202212-05         35         39
35   ID2      B2  202212-06         36         40
36   ID2      B2  202212-07         37         41
37   ID2      B2  202212-08         38         42
38   ID2      B2  202212-09         39         43
39   ID2      B2  202212-10         40         44

Now, the real table consists of more than a billion rows, so I want to take a random sample consisting of a sample of product for the last day of entry but it needs to be from ALL stores.

I tried the following approach:

Since I want the last date of entry I use a with clause to limit to the last date (max(DATE(product_timestamp))) and list all the stores with another with clause on stores. I then take the random sample:

query_random_sample = """ 
with maxdate as (select max(DATE(product_timestamp)) as maxdate from `MyProject.DataSet1.product_timeline`)
,
stores as (select store from `MyProject.DataSet1.stores`)

select t.*,
t2.ProductDescription,
t2.ProductName,
t2.CreatedDate,

from (`MyProject.DataSet1.product_timeline` as t
          join `MyProject.DataSet2.LableStore` as t2
          on t.store = t2.store
          and t.barcode = t2.barcode
          join maxdate 
          on maxdate.maxdate = DATE(t.product_timestamp)
          )  
          join stores 
          on stores.store = t.store
where rand()< 0.01

"""

job_config = bigquery.QueryJobConfig(
    query_parameters=[
    ]
)

sampled_labels = bigquery_client.query(query_random_sample, job_config=job_config).to_dataframe()

The problem is that it even samples on store, but I want the sample to be on product for each store.

I work in Python and an alternative would be to do the query for each store, but the cost of such a query would be huge (over 1200 stores).

How can I solve this is a cost efficient way.

CodePudding user response：

If I'm right to assume you want a random sample specific to each store, then I think your best bet is using a window function to do your random selection, using a window partitioned by Store:

    SELECT
        Store,
        Product,
        Date,
        property1,
        property2,
      FROM
       `MyProject.DataSet1.product_timeline`
   QUALIFY
        PERCENT_RANK() OVER(all_stores_rand) < 0.01
    WINDOW
        all_stores_rand AS (
          PARTITION BY Store
          ORDER BY RAND()
        )

To explain that, we are partitioning the table into one group per value of Store (analogous to what we'd do for a GROUP BY), then calculating PERCENT_RANK over a set of random numbers separately for each store (generating these numbers using RAND()).

Since the part of the table corresponding to each Store must then yield a set of values evenly spanning 0 to 1, we can throw this into a QUALIFY (BigQuery's filter clause for window expressions) in order to just grab 1% of the values for each Store.