Sparksql get sample rows with where clause-CodePudding

Is it possible to get a sample n rows of a query with a where clause?

I tried to use the tablesample function below but I ended up only getting records in the first partition '2021-09-14.' P

select * from (select * from table where ts in ('2021-09-14', '2021-09-15')) tablesample (100 rows)

CodePudding user response：

You can utilise Monotonically Increasing ID - here or Rand to generate an additional column which can be used to Order your dataset to generate the necessary sampling field

Both of these functions can be used in conjunction or individually

Further more you can use LIMIT clause to sample your required N records

NOTE - orderBy would be a costly operation

Data Preparation

input_str = """
1   2/12/2019   114 2
2   3/5/2019    116 1
3   3/3/2019    120 6
4   3/4/2019    321 10
6   6/5/2019    116 1
7   6/3/2019    116 1
8   10/1/2019   120 3
9   10/1/2019   120 3
10  10/1/2020   120 3
11  10/1/2020   120 3
12  10/1/2020   120 3
13  10/1/2022   120 3
14  10/1/2021   120 3
15  10/6/2019   120 3
""".split()

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))

cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "shipment_id  ship_date   customer_id quantity".split()))
            
n = len(input_values)

input_list = [tuple(input_values[i:i 4]) for i in range(0,n,4)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF = sparkDF.withColumn('ship_date',F.to_date(F.col('ship_date'),'d/M/yyyy'))

sparkDF.show()

 ----------- ---------- ----------- -------- 
|shipment_id| ship_date|customer_id|quantity|
 ----------- ---------- ----------- -------- 
|          1|2019-12-02|        114|       2|
|          2|2019-05-03|        116|       1|
|          3|2019-03-03|        120|       6|
|          4|2019-04-03|        321|      10|
|          6|2019-05-06|        116|       1|
|          7|2019-03-06|        116|       1|
|          8|2019-01-10|        120|       3|
|          9|2019-01-10|        120|       3|
|         10|2020-01-10|        120|       3|
|         11|2020-01-10|        120|       3|
|         12|2020-01-10|        120|       3|
|         13|2022-01-10|        120|       3|
|         14|2021-01-10|        120|       3|
|         15|2019-06-10|        120|       3|
 ----------- ---------- ----------- --------

Order By - Monotonically Increasing ID & Rand

sparkDF.createOrReplaceTempView("shipment_table")

sql.sql("""
SELECT
 *
FROM (
    SELECT 
        *
        ,monotonically_increasing_id() as increasing_id
        ,RAND(10) as random_order
    FROM shipment_table
    WHERE ship_date BETWEEN '2019-01-01' AND '2019-12-31'
    ORDER BY monotonically_increasing_id() DESC ,RAND(10) DESC
    LIMIT 5
)
""").show()

 ----------- ---------- ----------- -------- ------------- ------------------- 
|shipment_id| ship_date|customer_id|quantity|increasing_id|       random_order|
 ----------- ---------- ----------- -------- ------------- ------------------- 
|         15|2019-06-10|        120|       3|   8589934593|0.11682250456449328|
|          9|2019-01-10|        120|       3|   8589934592|0.03422639313807285|
|          8|2019-01-10|        120|       3|            6| 0.8078688178371882|
|          7|2019-03-06|        116|       1|            5|0.36664222617947817|
|          6|2019-05-06|        116|       1|            4|    0.2093704977577|
 ----------- ---------- ----------- -------- ------------- -------------------

CodePudding user response：

If you are using Dataset there is built-in functionality for this as outlined in the documenation:

sample(withReplacement: Boolean, fraction: Double): Dataset[T]

Returns a new Dataset by sampling a fraction of rows, using a random seed.

withReplacement: Sample with replacement or not.
fraction: Fraction of rows to generate, range [0.0, 1.0].

Since

    1.6.0
Note

    This is NOT guaranteed to provide exactly the fraction of the total count of the given Dataset.

To use this you'd filter your dataset against whatever criteria you're looking for, then sample the result. If you need an exact number of rows rather than a fraction you can follow the call to sample with limit(n) where n is the number of rows to return.