Is it possible to get a sample n rows of a query with a where clause?
I tried to use the tablesample function below but I ended up only getting records in the first partition '2021-09-14.' P
select * from (select * from table where ts in ('2021-09-14', '2021-09-15')) tablesample (100 rows)
CodePudding user response:
You can utilise Monotonically Increasing ID - here or Rand to generate an additional column which can be used to Order your dataset to generate the necessary sampling field
Both of these functions can be used in conjunction or individually
Further more you can use LIMIT
clause to sample your required N
records
NOTE - orderBy would be a costly operation
Data Preparation
input_str = """
1 2/12/2019 114 2
2 3/5/2019 116 1
3 3/3/2019 120 6
4 3/4/2019 321 10
6 6/5/2019 116 1
7 6/3/2019 116 1
8 10/1/2019 120 3
9 10/1/2019 120 3
10 10/1/2020 120 3
11 10/1/2020 120 3
12 10/1/2020 120 3
13 10/1/2022 120 3
14 10/1/2021 120 3
15 10/6/2019 120 3
""".split()
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "shipment_id ship_date customer_id quantity".split()))
n = len(input_values)
input_list = [tuple(input_values[i:i 4]) for i in range(0,n,4)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF = sparkDF.withColumn('ship_date',F.to_date(F.col('ship_date'),'d/M/yyyy'))
sparkDF.show()
----------- ---------- ----------- --------
|shipment_id| ship_date|customer_id|quantity|
----------- ---------- ----------- --------
| 1|2019-12-02| 114| 2|
| 2|2019-05-03| 116| 1|
| 3|2019-03-03| 120| 6|
| 4|2019-04-03| 321| 10|
| 6|2019-05-06| 116| 1|
| 7|2019-03-06| 116| 1|
| 8|2019-01-10| 120| 3|
| 9|2019-01-10| 120| 3|
| 10|2020-01-10| 120| 3|
| 11|2020-01-10| 120| 3|
| 12|2020-01-10| 120| 3|
| 13|2022-01-10| 120| 3|
| 14|2021-01-10| 120| 3|
| 15|2019-06-10| 120| 3|
----------- ---------- ----------- --------
Order By - Monotonically Increasing ID & Rand
sparkDF.createOrReplaceTempView("shipment_table")
sql.sql("""
SELECT
*
FROM (
SELECT
*
,monotonically_increasing_id() as increasing_id
,RAND(10) as random_order
FROM shipment_table
WHERE ship_date BETWEEN '2019-01-01' AND '2019-12-31'
ORDER BY monotonically_increasing_id() DESC ,RAND(10) DESC
LIMIT 5
)
""").show()
----------- ---------- ----------- -------- ------------- -------------------
|shipment_id| ship_date|customer_id|quantity|increasing_id| random_order|
----------- ---------- ----------- -------- ------------- -------------------
| 15|2019-06-10| 120| 3| 8589934593|0.11682250456449328|
| 9|2019-01-10| 120| 3| 8589934592|0.03422639313807285|
| 8|2019-01-10| 120| 3| 6| 0.8078688178371882|
| 7|2019-03-06| 116| 1| 5|0.36664222617947817|
| 6|2019-05-06| 116| 1| 4| 0.2093704977577|
----------- ---------- ----------- -------- ------------- -------------------
CodePudding user response:
If you are using Dataset
there is built-in functionality for this as outlined in the documenation:
sample(withReplacement: Boolean, fraction: Double): Dataset[T]
Returns a new Dataset by sampling a fraction of rows, using a random seed.
withReplacement: Sample with replacement or not.
fraction: Fraction of rows to generate, range [0.0, 1.0].
Since
1.6.0
Note
This is NOT guaranteed to provide exactly the fraction of the total count of the given Dataset.
To use this you'd filter your dataset against whatever criteria you're looking for, then sample the result. If you need an exact number of rows rather than a fraction you can follow the call to sample
with limit(n)
where n
is the number of rows to return.