I am using Spark with Java and I have a dataframe like this:
id | array_column
-------------------
12 | [a:123, b:125, c:456]
13 | [a:443, b:225, c:126]
and I want to explode array_column with the same id, however explode
doesn't enough because I want dataframe to be:
id | a | b | c
-------------------
12 |123 |125 | 456
13 |443 |225 | 126
Therefore, normal explode
on array_column
doesn't work good in that case.
I would be happy for your help, Thank you!
CodePudding user response:
The following approach will work on variable length lists in array_column
. The approach uses explode
to expand the list of string elements in array_column
before splitting each string element using :
into two different columns col_name
and col_val
respectively. Finally a pivot is used with a group by to transpose the data into the desired format.
The following example uses the pyspark api but can easily be translated to the java/scala apis as they are similar. I assumed your dataset is in a dataframe named input_df
from pyspark.sql import functions as F
output_df = (
input_df.select("id",F.explode("array_column").alias("acol"))
.select(
"id",
F.split("acol",":")[0].alias("col_name"),
F.split("acol",":")[1].cast("integer").alias("col_val")
)
.groupBy("id")
.pivot("col_name")
.max("col_val")
)
Let me know if this works for you.
CodePudding user response:
A very similar approach like ggordon's answer in Java:
import static org.apache.spark.sql.functions.*;
Dataset<Row> df = ...
df.withColumn("array_column", explode(col("array_column")))
.withColumn("array_column", split(col("array_column"), ":"))
.withColumn("key", col("array_column").getItem(0))
.withColumn("value", col("array_column").getItem(1))
.groupBy(col("id"))
.pivot(col("key"))
.agg(first("value")) //1
.show();
Output:
--- --- --- ---
| id| a| b| c|
--- --- --- ---
| 12|456|225|126|
| 11|123|125|456|
--- --- --- ---
I assume that the combination of id
and and the key field in the array is unique. That's why the aggregation function used at //1
is first
. If this combination is not unique, the aggregation function could be changed to collect_list
in order to get an array of all matching values.