How to split array of set to multiple columns in SparkSQL-CodePudding

I have array of set data below data in pyspark dataframe like below.

- ----------------------------------------------------------------------------------- -
 |                             targeting_values                                      |
- ----------------------------------------------------------------------------------- -
 | [('123', '123', '123'), ('abc', 'def', 'ghi'), ('jkl', 'mno', 'pqr'), (0, 1, 2)]  |
- ----------------------------------------------------------------------------------- -

I want 4 different columns have with set in each column like below.

- ---------------------- ---------------------- ----------------------- -------------------- -
 |  value1              |    value2            |     value3            |     value4         |
- ---------------------- ---------------------- ----------------------- -------------------- -
 | ('123', '123', '123')|('abc', 'def', 'ghi') | ('jkl', 'mno', 'pqr') | (0, 1, 2)          |
- ---------------------- ---------------------- ----------------------- -------------------- -

I was trying to achieve this by using split() but no luck. I did not found other way to do solve this issue.

So is there a good way to do this?

CodePudding user response：

You can do it by exploding the array than pivoting it,

// first create the data:

val arrayStructData = Seq(
  Row(List(Row("123", "123", "123"), Row("abc", "def", "ghi"), Row("jkl", "mno", "pqr"), Row("0", "1", "2"))),
  Row(List(Row("456", "456", "456"), Row("qsd", "fgh", "hjk"), Row("aze", "rty", "uio"), Row("4", "5", "6")))
)

val arrayStructSchema = new StructType()
  .add("targeting_values", ArrayType(new StructType()
    .add("_1", StringType)
    .add("_2", StringType)
    .add("_3", StringType)))

val df = spark.createDataFrame(spark.sparkContext
  .parallelize(arrayStructData), arrayStructSchema)

df.show(false)

 -------------------------------------------------------------- 
|targeting_values                                              |
 -------------------------------------------------------------- 
|[{123, 123, 123}, {abc, def, ghi}, {jkl, mno, pqr}, {0, 1, 2}]|
|[{456, 456, 456}, {qsd, fgh, hjk}, {aze, rty, uio}, {4, 5, 6}]|
 --------------------------------------------------------------

// Then a combination of explode, creating and id then pivoting it like this:

df.withColumn("id2", monotonically_increasing_id())
  .select(col("id2"), posexplode(col("targeting_values"))).withColumn("id", concat(lit("value"), col("pos")   1))
  .groupBy("id2").pivot("id").agg(first("col")).drop("id2")
  .show(false)

 --------------- --------------- --------------- --------- 
|value1         |value2         |value3         |value4   |
 --------------- --------------- --------------- --------- 
|{123, 123, 123}|{abc, def, ghi}|{jkl, mno, pqr}|{0, 1, 2}|
|{456, 456, 456}|{qsd, fgh, hjk}|{aze, rty, uio}|{4, 5, 6}|
 --------------- --------------- --------------- ---------

CodePudding user response：

You can try this:

df.selectExpr([f"targeting_values[{i}] as value{i 1}" for i in range(4)])