I have a column in a Dataset[Row]
which is a list of structs in (Scala Spark) with the fields id
(String) and score
(Double). I need to convert the list of structs to a raw string in order to print it out without the [
bracket symbols which are automatically appended to the end of each struct and the list when printing. For example, when I print the column out now it looks like this:
[[id1, 0.4], [id2, 0.2], [id3, 0.2], [id4, 0.2]]
but I need to remove the [
on either end of the list and replace the ,
delimiters with :
(or any delimiter which is not a ,
) like this (and maintain order):
id1, 0.4: id2, 0.2: id3, 0.2: id4, 0.2
I tried to use the concat_ws
method however it only accepts (array<string> or string)
. Is it possible to convert my list of structs to one long string?
CodePudding user response:
Here is a one solution. Starting dataset:
df.show(false)
// ------------------------------------------------
// |foo |
// ------------------------------------------------
// |[{id1, 0.4}, {id2, 0.2}, {id3, 0.2}, {id4, 0.2}]|
// ------------------------------------------------
df.printSchema
// root
// |-- foo: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- id: string (nullable = true)
// | | |-- score: double (nullable = false)
To arrive at the desired string representation, transform each element of the array into a string using transform
with concat_ws
, then combine array elements into a string using array_join
:
import org.apache.spark.sql.functions._
df.select(
array_join(
transform(col("foo"), c => concat_ws(", ", c.getField("id"), c.getField("score"))),
": "
) as "foo_str"
).show(false)
// --------------------------------------
// |foo_str |
// --------------------------------------
// |id1, 0.4: id2, 0.2: id3, 0.2: id4, 0.2|
// --------------------------------------
transform(c, f: (Column) => Column)
runs over the elements of the array column c
and applies f
to each element. In that case, f
calls concat_ws
and since the input to f
is a column of structs, we use getField("x")
to select the value of field x
. This results in an array of strings that can be concatenated into a single string using array_join
.
CodePudding user response:
Try this one:
import org.apache.spark.sql._
import org.apache.spark.sql.types.{ArrayType, DoubleType, StringType, StructType}
case class ScoreObj(id: String, score: Double)
case class Record(value: String, scores: List[ScoreObj])
object App {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
// create dataframe with test data
val data = Seq(
Row("aaa", List(Row("id1", 0.4), Row("id2", 0.5)))
)
val schema = new StructType()
.add("value", StringType)
.add("scores", ArrayType(new StructType()
.add("id", StringType)
.add("score", DoubleType)))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
df.show(false)
// ----- ------------------------
// |value|scores |
// ----- ------------------------
// |aaa |[[id1, 0.4], [id2, 0.5]]|
// ----- ------------------------
// transform the array column into string
df.as[Record].map { case Record(value: String, scores: Seq[ScoreObj]) =>
(value, scores.map { case ScoreObj(id, score) => s"$id, $score" }.mkString(": "))
}.toDF("value", "scores_str").show()
// ----- ------------------
// |value| scores_str|
// ----- ------------------
// | aaa|id1, 0.4: id2, 0.5|
// ----- ------------------
}
}