I wanted to change the order of the columns in a dataframe and found out that the correct way to do it was the following:
val reorderedColumnNames: Array[String] = ??? // the new order of columns
val result: DataFrame = dataFrame.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)
And my question is:
Why it can´t be done like the following if the colon ": _*" is supposed to get you all the elements of a collection (Seq, List or Array)?
val result: DataFrame = dataFrame.select(reorderedColumnNames: _*)
I don't understand why the second way isn't accepted by Spark (and produces an error) because both ways are supposed to be equivalent
CodePudding user response:
The API of the Dataset
object requires it to be this way.
First, Dataframe
is (since 2.0 as I recall, more or less) just a Dataset[Row]
in terms of implementation, so you have to check the API doc of Dataset.
Looking at the select
method, there are two implementations to chose from.
def select(col: String, cols: String*): DataFrame
This one accepts a single String
, followed by any number of String
.
def select(cols: Column*): DataFrame
This one accepts any number of Column
instances.
Your sample code : val reorderedColumnNames: Array[String]
is an array of string. As such, you can only use the string variant of the method, which requires the use of at least one String
argument, followed by possibly others.
That is why you can not expand you array directly.
If you were to convert your Array[String]
to an Array[Column]
, it would work, that is :
val reorderedColumnNamesAsCols: Array[Column] = reorderedColumnNames.map(col(_))
dataFrame.select(reorderedColumnNamesAsCols: _*)
Now why not also propose a select(cols: String*)
variant also ? I have no knowledge of if / why this as been proposed but I guess that it would not play nicely by the compiler.
We can check :
scala> class Test {
| def test(strs: String*): Unit = {}
| def test(nbrs: Integer*): Unit = {}
| }
<console>:13: error: double definition:
def test(strs: String*): Unit at line 12 and
def test(nbrs: Integer*): Unit at line 13
have same type after erasure: (strs: Seq)Unit
def test(nbrs: Integer*): Unit = {}
^
Compiler not happy.