Home > database >  Does Spark .load() all data into DF and then performing .select("fields")?
Does Spark .load() all data into DF and then performing .select("fields")?

Time:10-07

I read that Spark retrieve only needed data, but how can i check this using Scala? I am loading data from ES index to Spark DF using Scala. And need to select only needed fields, if i use this:

val indexData = sparkSession.read
    .format("es")
    .option("scroll.limit", 100000)
    .load(index)
    .select("country")

Will spark load all fields for records and then select "country" or select "country" first and only then load data?

CodePudding user response:

You can check if the "schema push down" (only the select columns are loaded from the source) works in the physical query plan.

Take for example this simple snippet and run it on your local machine:

import org.apache.spark.sql._

object App {

  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
    import spark.implicits._

    Seq((1,"aa"),(2,"bb"),(3, "cc")).toDF("id", "value").write.mode("overwrite").parquet("tmp_data")
    val df = spark.read.parquet("tmp_data").select("id")
    df.explain
  }
}

The output should be something similar to:

== Physical Plan ==
*(1) ColumnarToRow
 - FileScan parquet [id#13] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/gabriel/IdeaProjects/SparkTests/tmp_data], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int>

According to the ReadSchema: struct<id:int> you can see that only the data of the id column was loaded from the source.

  • Related