Home > OS >  The result of applying "limit" to spark SQL is not as expected
The result of applying "limit" to spark SQL is not as expected

Time:06-20

var data = Seq[(String, Int)]()

for (i <- 1 until 10000) {
    val str = f"value: ${i}"
    data = data :  (str, i)
}

val df = spark.sparkContext.parallelize(data).toDF()
df.createOrReplaceTempView("v_logs")

val a = spark.sql(
    f"""
    SELECT * FROM v_logs limit 20 <---- query
    """
)
a.show() <----- 1
a.show() <----- 2
a.show() <----- 3

a.select(col("_2")).show() <-----4
a.select(col("_2")).show() <-----5
a.select(col("_2")).show() <-----6

It's some spark code using scala. I expected the results of 1,2,3 to be the same and 4,5,6 to be the same, but it wasn't. Of course, adding "order by _2" to the query gives the expected result.I think it's because of the inner workings of spark, but I'm not sure. Could you please elaborate on this?

CodePudding user response:

a.select(col("_2")) doesn't order the column

I tried your code but get expected results: 1,2,3 are all listing:

 --------- --- 
|       _1| _2|
 --------- --- 
| value: 1|  1|
| value: 2|  2|
| value: 3|  3|
| value: 4|  4|
| value: 5|  5|
| value: 6|  6|
| value: 7|  7|
| value: 8|  8|
| value: 9|  9|
|value: 10| 10|
|value: 11| 11|
|value: 12| 12|
|value: 13| 13|
|value: 14| 14|
|value: 15| 15|
|value: 16| 16|
|value: 17| 17|
|value: 18| 18|
|value: 19| 19|
|value: 20| 20|
 --------- --- 
  • Related