I use the following code:
random = [("ABC",xx, 1),
("DEF",yy,1),
("GHI",zz, 0)
]
randomColumns = ["name","id", "male"]
randomDF = spark.createDataFrame(data=random, schema = randomColumns)
test_df = randomDF.select("name", "id")
test_df.filter(f.col("male") == '1').show()
From the above code I expect it to result in an error because for the test_df i dont select the male column from the original dataframe. Surprisingly the above query runs just fine without any error and outputs the following:
--------- -------
|name|id|
--------- -------
| abc| xx|
| def| yy|
--------- -------
I want to understand the logic behind what spark is doing. As per the spark documentation Select returns a new dataframe. Then why is it still able to use the male column from the parent dataframe.
CodePudding user response:
This is caused by the DAG generated by Spark. Some operators (or transformers
) is are lazy-executed, so they pave the way for Spark to optimize the DAG.
In this example, there are two major steps: select
(or project
in SQL's jargon) first, and filter
later. But in fact, when executing, filter
first, and then select
, because it is more efficient.
You can verify this conclusion by explain()
method:
test_df.filter(f.col("flag") == '1').explain()
It will output:
== Physical Plan ==
*(1) Project [dept_name#0, dept_id#1L]
- *(1) Filter (isnotnull(flag#2L) AND (flag#2L = 1))
- *(1) Scan ExistingRDD[dept_name#0,dept_id#1L,flag#2L]
CodePudding user response:
Adding to @chenzhongpu 's answer, please note that if you define a temp view on top of your test_df
, the query will fail:
test_df.createOrReplaceTempView("test_df")
spark.sql("select * from test_df where flag = 1").show()
_Traceback (most recent call last): ...
:
pyspark.sql.utils.AnalysisException: u"cannot resolve '`flag`' given input columns: [test_df.dept, test_df.id]; line 1 pos 24;
'Project [*]
- 'Filter ('flag = 1)
- SubqueryAlias `test_df`
- Project [dept#0, id#2L]
- LogicalRDD [dept#0, flag#1L, id#2L], false
_
...because a select
(=Project
node in the execution plan) is going to precede filter (attempted via where
clause).