I have array of set data below data in pyspark dataframe like below.
- ----------------------------------------------------------------------------------- -
| targeting_values |
- ----------------------------------------------------------------------------------- -
| [('123', '123', '123'), ('abc', 'def', 'ghi'), ('jkl', 'mno', 'pqr'), (0, 1, 2)] |
- ----------------------------------------------------------------------------------- -
I want 4 different columns have with set in each column like below.
- ---------------------- ---------------------- ----------------------- -------------------- -
| value1 | value2 | value3 | value4 |
- ---------------------- ---------------------- ----------------------- -------------------- -
| ('123', '123', '123')|('abc', 'def', 'ghi') | ('jkl', 'mno', 'pqr') | (0, 1, 2) |
- ---------------------- ---------------------- ----------------------- -------------------- -
I was trying to achieve this by using split() but no luck. I did not found other way to do solve this issue.
So is there a good way to do this?
CodePudding user response:
You can do it by exploding the array than pivoting it,
// first create the data:
val arrayStructData = Seq(
Row(List(Row("123", "123", "123"), Row("abc", "def", "ghi"), Row("jkl", "mno", "pqr"), Row("0", "1", "2"))),
Row(List(Row("456", "456", "456"), Row("qsd", "fgh", "hjk"), Row("aze", "rty", "uio"), Row("4", "5", "6")))
)
val arrayStructSchema = new StructType()
.add("targeting_values", ArrayType(new StructType()
.add("_1", StringType)
.add("_2", StringType)
.add("_3", StringType)))
val df = spark.createDataFrame(spark.sparkContext
.parallelize(arrayStructData), arrayStructSchema)
df.show(false)
--------------------------------------------------------------
|targeting_values |
--------------------------------------------------------------
|[{123, 123, 123}, {abc, def, ghi}, {jkl, mno, pqr}, {0, 1, 2}]|
|[{456, 456, 456}, {qsd, fgh, hjk}, {aze, rty, uio}, {4, 5, 6}]|
--------------------------------------------------------------
// Then a combination of explode, creating and id then pivoting it like this:
df.withColumn("id2", monotonically_increasing_id())
.select(col("id2"), posexplode(col("targeting_values"))).withColumn("id", concat(lit("value"), col("pos") 1))
.groupBy("id2").pivot("id").agg(first("col")).drop("id2")
.show(false)
--------------- --------------- --------------- ---------
|value1 |value2 |value3 |value4 |
--------------- --------------- --------------- ---------
|{123, 123, 123}|{abc, def, ghi}|{jkl, mno, pqr}|{0, 1, 2}|
|{456, 456, 456}|{qsd, fgh, hjk}|{aze, rty, uio}|{4, 5, 6}|
--------------- --------------- --------------- ---------
CodePudding user response:
You can try this:
df.selectExpr([f"targeting_values[{i}] as value{i 1}" for i in range(4)])