Say we have the following dataframe (which is borrowed from 'PySpark by Examples' website):
simpleData = [("James","Sales","NY",90000,34,10000), \
("Michael","Sales","NY",86000,56,20000), \
("Robert","Sales","CA",81000,30,23000), \
("Maria","Finance","CA",90000,24,23000), \
("Raman","Finance","CA",99000,40,24000), \
("Scott","Finance","NY",83000,36,19000), \
("Jen","Finance","NY",79000,53,15000), \
("Jeff","Marketing","CA",80000,25,18000), \
("Kumar","Marketing","NY",91000,50,21000) \
]
Then, if we run the two following sort (orderBy) commands:
df.sort("department","state").show(truncate=False)
or
df.sort(col("department"),col("state")).show(truncate=False)
We get the same result:
------------- ---------- ----- ------ --- -----
|employee_name|department|state|salary|age|bonus|
------------- ---------- ----- ------ --- -----
|Maria |Finance |CA |90000 |24 |23000|
|Raman |Finance |CA |99000 |40 |24000|
|Jen |Finance |NY |79000 |53 |15000|
|Scott |Finance |NY |83000 |36 |19000|
|Jeff |Marketing |CA |80000 |25 |18000|
|Kumar |Marketing |NY |91000 |50 |21000|
|Robert |Sales |CA |81000 |30 |23000|
|James |Sales |NY |90000 |34 |10000|
|Michael |Sales |NY |86000 |56 |20000|
------------- ---------- ----- ------ --- -----
I know the first one takes the DataFrame column name as a string and the next one takes columns in Column type. But is there a difference between these two in case of tasks such as processing or future uses? Is one of them better than the other or pySpark standard form? Or are they just aliases?
PS: In addition to the above, one of the reasons I'm asking this question is that someone told me there is a 'standard' business form for using Spark. For example, 'alias' is more popular than 'withColumnRenamed' in the business. Of course, this doesn't sound right to me.
CodePudding user response:
To be certain that the two versions do the same thing, we can have a look at the source code of dataframe.py. Here is the signature of the sort
method:
def sort(
self, *cols: Union[str, Column, List[Union[str, Column]]], **kwargs: Any
) -> "DataFrame":
When you follow the various method calls, you end up on this line:
jcols = [_to_java_column(cast("ColumnOrName", c)) for c in cols]
, that converts all column objects, whether they are strings or columns (cf method signature) to java columns. Then only the java columns are used regardless of how they were passed to the method so the two versions of the sort method do the exact same thing with the exact same code.
CodePudding user response:
If you look at the explain plan you'll see that both queries generate the same physical plan, so processing wise they are identical.
df_sort1 = df.sort("department", "state")
df_sort2 = df.sort(col("department"), col("state"))
df_sort1.explain()
df_sort2.explain()
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
- Sort [department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST], true, 0
- Exchange rangepartitioning(department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#8]
- Scan ExistingRDD[employee_name#0,department#1,state#2,salary#3L,age#4L,bonus#5L]
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
- Sort [department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST], true, 0
- Exchange rangepartitioning(department#1 ASC NULLS FIRST, state#2 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#18]
- Scan ExistingRDD[employee_name#0,department#1,state#2,salary#3L,age#4L,bonus#5L]
Businesses might have coding guidelines in which they specify what to use. If they exist then follow them. If not and you're working on existing code then usually best to follow what is there already. Otherwise its mainly preference, I'm not aware of a 'standard business form' of pyspark.
In case of alias vs withColumnRenamed there is an argument to be made in favor of alias if you're renaming multiple columns. Selecting with alias will generate a single projection in the parsed logical plan where using multiple withColumnRenamed will generate multiple projections.