I have a DataFrame with multiple columns, e.g.
root
|-- playerName
|-- country
|-- bowlingAvg
|-- bowlingSR
|-- wickets
|-- battingAvg
|-- battingSR
|-- runs
I also have a list of the column names which corresponds to bowling stats:
List bowlingParams = new ArrayList(Arrays.asList("bowlingAvg", "bowlingSR", "wickets"));
Expected Schema:
root
|-- playerName
|-- country
|-- bowlingAvg
|-- bowlingSR
|-- wickets
|-- battingAvg
|-- battingSR
|-- runs
|-- bowlingStats
|-- bowlingAvg
|-- bowlingSR
|-- wickets
I can do it like this
playerDF = playerDF.withColumn("bowlingStats", functions.struct("bowlingAvg", "bowlingSR", "wickets"))
However, I want to use the list to dynamically select the column for struct.
I know we can do it like this in Scala
playerDF = playerDF.select(struct(bowlingParams.map(col): _*))
and, I have also found a reference on how to do this in Python
Is there a way we can do this in Java with Spark?
CodePudding user response:
For java this solution worked for me,
remove the one attribute from list(non dynamic one)
convert the remaining list to Scala Sequence using JavaConverters.
when creating nested column , in struct use one attribute(as string) and your converted Scala Seq.
import scala.collection.JavaConverters; List bowlingParams = new ArrayList(Arrays.asList("bowlingSR", "wickets")); playerDF = playerDF.withColumn("bowlingStats", functions.struct("bowlingAvg",JavaConverters.asScalaIteratorConverter(bowlingParams.iterator()).asScala().toSeq()));