Home > Software engineering >  Reason behind using x.head, x.tail: _* in Spark
Reason behind using x.head, x.tail: _* in Spark

Time:10-25

I wanted to change the order of the columns in a dataframe and found out that the correct way to do it was the following: val reorderedColumnNames: Array[String] = ??? // the new order of columns

val result: DataFrame = dataFrame.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)

And my question is: Why it can´t be done like the following if the colon ": _*" is supposed to get you all the elements of a collection (Seq, List or Array)? val result: DataFrame = dataFrame.select(reorderedColumnNames: _*)

I don't understand why the second way isn't accepted by Spark (and produces an error) because both ways are supposed to be equivalent

CodePudding user response:

The API of the Dataset object requires it to be this way.

First, Dataframe is (since 2.0 as I recall, more or less) just a Dataset[Row] in terms of implementation, so you have to check the API doc of Dataset.

Looking at the select method, there are two implementations to chose from.

def select(col: String, cols: String*): DataFrame 

This one accepts a single String, followed by any number of String.

def select(cols: Column*): DataFrame 

This one accepts any number of Column instances.

Your sample code : val reorderedColumnNames: Array[String] is an array of string. As such, you can only use the string variant of the method, which requires the use of at least one String argument, followed by possibly others. That is why you can not expand you array directly.

If you were to convert your Array[String] to an Array[Column], it would work, that is :

val reorderedColumnNamesAsCols: Array[Column] = reorderedColumnNames.map(col(_))
dataFrame.select(reorderedColumnNamesAsCols: _*)

Now why not also propose a select(cols: String*) variant also ? I have no knowledge of if / why this as been proposed but I guess that it would not play nicely by the compiler. We can check :

scala> class Test {
 |   def test(strs: String*): Unit = {}
 |   def test(nbrs: Integer*): Unit = {}
 | }
<console>:13: error: double definition:
def test(strs: String*): Unit at line 12 and
def test(nbrs: Integer*): Unit at line 13
have same type after erasure: (strs: Seq)Unit
     def test(nbrs: Integer*): Unit = {}
         ^

Compiler not happy.

  • Related