I have a Biguery table consisting of multiple entries for each ID
for each day. Basically, the IDs are stores with a list of products for which 2 columns represent properties.
Store Product Date property1 property2
0 ID1 A1 202212-01 1 5
1 ID1 A1 202212-02 2 6
2 ID1 A1 202212-03 3 7
3 ID1 A1 202212-04 4 8
4 ID1 A1 202212-05 5 9
5 ID1 A1 202212-06 6 10
6 ID1 A1 202212-07 7 11
7 ID1 A1 202212-08 8 12
8 ID1 A1 202212-09 9 13
9 ID1 A1 202212-10 10 14
10 ID1 A2 202212-01 11 15
11 ID1 A2 202212-02 12 16
12 ID1 A2 202212-03 13 17
13 ID1 A2 202212-04 14 18
14 ID1 A2 202212-05 15 19
15 ID1 A2 202212-06 16 20
16 ID1 A2 202212-07 17 21
17 ID1 A2 202212-08 18 22
18 ID1 A2 202212-09 19 23
19 ID1 A2 202212-10 20 24
20 ID2 B1 202212-01 21 25
21 ID2 B1 202212-02 22 26
22 ID2 B1 202212-03 23 27
23 ID2 B1 202212-04 24 28
24 ID2 B1 202212-05 25 29
25 ID2 B1 202212-06 26 30
26 ID2 B1 202212-07 27 31
27 ID2 B1 202212-08 28 32
28 ID2 B1 202212-09 29 33
29 ID2 B1 202212-10 30 34
30 ID2 B2 202212-01 31 35
31 ID2 B2 202212-02 32 36
32 ID2 B2 202212-03 33 37
33 ID2 B2 202212-04 34 38
34 ID2 B2 202212-05 35 39
35 ID2 B2 202212-06 36 40
36 ID2 B2 202212-07 37 41
37 ID2 B2 202212-08 38 42
38 ID2 B2 202212-09 39 43
39 ID2 B2 202212-10 40 44
Now, the real table consists of more than a billion rows, so I want to take a random sample consisting of a sample of product for the last day of entry but it needs to be from ALL stores.
I tried the following approach:
Since I want the last date of entry I use a with
clause to limit to the last date (max(DATE(product_timestamp))
) and list all the stores with another with
clause on stores. I then take the random sample:
query_random_sample = """
with maxdate as (select max(DATE(product_timestamp)) as maxdate from `MyProject.DataSet1.product_timeline`)
,
stores as (select store from `MyProject.DataSet1.stores`)
select t.*,
t2.ProductDescription,
t2.ProductName,
t2.CreatedDate,
from (`MyProject.DataSet1.product_timeline` as t
join `MyProject.DataSet2.LableStore` as t2
on t.store = t2.store
and t.barcode = t2.barcode
join maxdate
on maxdate.maxdate = DATE(t.product_timestamp)
)
join stores
on stores.store = t.store
where rand()< 0.01
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
]
)
sampled_labels = bigquery_client.query(query_random_sample, job_config=job_config).to_dataframe()
The problem is that it even samples on store
, but I want the sample to be on product
for each store.
I work in Python and an alternative would be to do the query for each store, but the cost of such a query would be huge (over 1200 stores).
How can I solve this is a cost efficient way.
CodePudding user response:
If I'm right to assume you want a random sample specific to each store, then I think your best bet is using a window function to do your random selection, using a window partitioned by Store
:
SELECT
Store,
Product,
Date,
property1,
property2,
FROM
`MyProject.DataSet1.product_timeline`
QUALIFY
PERCENT_RANK() OVER(all_stores_rand) < 0.01
WINDOW
all_stores_rand AS (
PARTITION BY Store
ORDER BY RAND()
)
To explain that, we are partitioning the table into one group per value of Store
(analogous to what we'd do for a GROUP BY
), then calculating PERCENT_RANK
over a set of random numbers separately for each store (generating these numbers using RAND()
).
Since the part of the table corresponding to each Store must then yield a set of values evenly spanning 0 to 1, we can throw this into a QUALIFY
(BigQuery's filter clause for window expressions) in order to just grab 1% of the values for each Store
.