Print a list of structs in scala without brackets "[" in Spark Scala-CodePudding

I have a column in a Dataset[Row] which is a list of structs in (Scala Spark) with the fields id (String) and score (Double). I need to convert the list of structs to a raw string in order to print it out without the [ bracket symbols which are automatically appended to the end of each struct and the list when printing. For example, when I print the column out now it looks like this:

[[id1, 0.4], [id2, 0.2], [id3, 0.2], [id4, 0.2]]

but I need to remove the [ on either end of the list and replace the , delimiters with : (or any delimiter which is not a ,) like this (and maintain order):

id1, 0.4: id2, 0.2: id3, 0.2: id4, 0.2

I tried to use the concat_ws method however it only accepts (array<string> or string). Is it possible to convert my list of structs to one long string?

CodePudding user response：

Here is a one solution. Starting dataset:

df.show(false)
//  ------------------------------------------------ 
// |foo                                             |
//  ------------------------------------------------ 
// |[{id1, 0.4}, {id2, 0.2}, {id3, 0.2}, {id4, 0.2}]|
//  ------------------------------------------------ 

df.printSchema
// root
//  |-- foo: array (nullable = false)
//  |    |-- element: struct (containsNull = false)
//  |    |    |-- id: string (nullable = true)
//  |    |    |-- score: double (nullable = false)

To arrive at the desired string representation, transform each element of the array into a string using transform with concat_ws, then combine array elements into a string using array_join:

import org.apache.spark.sql.functions._

df.select(
  array_join(
    transform(col("foo"), c => concat_ws(", ", c.getField("id"), c.getField("score"))),
    ": "
  ) as "foo_str"
).show(false)
//  -------------------------------------- 
// |foo_str                               |
//  -------------------------------------- 
// |id1, 0.4: id2, 0.2: id3, 0.2: id4, 0.2|
//  --------------------------------------

transform(c, f: (Column) => Column) runs over the elements of the array column c and applies f to each element. In that case, f calls concat_ws and since the input to f is a column of structs, we use getField("x") to select the value of field x. This results in an array of strings that can be concatenated into a single string using array_join.

CodePudding user response：

Try this one:

import org.apache.spark.sql._
import org.apache.spark.sql.types.{ArrayType, DoubleType, StringType, StructType}

case class ScoreObj(id: String, score: Double)

case class Record(value: String, scores: List[ScoreObj])


object App {

  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
    import spark.implicits._

    // create dataframe with test data
    val data = Seq(
      Row("aaa", List(Row("id1", 0.4), Row("id2", 0.5)))
    )

    val schema = new StructType()
      .add("value", StringType)
      .add("scores", ArrayType(new StructType()
        .add("id", StringType)
        .add("score", DoubleType)))

    val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

    df.show(false)
//     ----- ------------------------ 
//    |value|scores                  |
//     ----- ------------------------ 
//    |aaa  |[[id1, 0.4], [id2, 0.5]]|
//     ----- ------------------------ 

    // transform the array column into string
    df.as[Record].map { case Record(value: String, scores: Seq[ScoreObj]) =>
      (value, scores.map { case ScoreObj(id, score) => s"$id, $score" }.mkString(": "))
    }.toDF("value", "scores_str").show()

//     ----- ------------------ 
//    |value|        scores_str|
//     ----- ------------------ 
//    |  aaa|id1, 0.4: id2, 0.5|
//     ----- ------------------ 
  }
}