Spark explode array column to columns-CodePudding

I am using Spark with Java and I have a dataframe like this:

id  | array_column
-------------------
12  | [a:123, b:125, c:456]
13  | [a:443, b:225, c:126]

and I want to explode array_column with the same id, however explode doesn't enough because I want dataframe to be:

id  | a  | b  | c
-------------------
12  |123 |125 | 456 
13  |443 |225 | 126

Therefore, normal explode on array_column doesn't work good in that case.

I would be happy for your help, Thank you!

CodePudding user response：

The following approach will work on variable length lists in array_column. The approach uses explode to expand the list of string elements in array_column before splitting each string element using : into two different columns col_name and col_val respectively. Finally a pivot is used with a group by to transpose the data into the desired format.

The following example uses the pyspark api but can easily be translated to the java/scala apis as they are similar. I assumed your dataset is in a dataframe named input_df

from pyspark.sql import functions as F

output_df = (
    input_df.select("id",F.explode("array_column").alias("acol"))
            .select(
                "id",
                F.split("acol",":")[0].alias("col_name"),
                F.split("acol",":")[1].cast("integer").alias("col_val")
            )
            .groupBy("id")
            .pivot("col_name")
            .max("col_val")
)

Let me know if this works for you.

CodePudding user response：

A very similar approach like ggordon's answer in Java:

import static org.apache.spark.sql.functions.*;

Dataset<Row> df = ...

df.withColumn("array_column", explode(col("array_column")))
        .withColumn("array_column", split(col("array_column"), ":"))
        .withColumn("key", col("array_column").getItem(0))
        .withColumn("value", col("array_column").getItem(1))
        .groupBy(col("id"))
        .pivot(col("key"))
        .agg(first("value")) //1
        .show();

Output:

 --- --- --- --- 
| id|  a|  b|  c|
 --- --- --- --- 
| 12|456|225|126|
| 11|123|125|456|
 --- --- --- ---

I assume that the combination of id and and the key field in the array is unique. That's why the aggregation function used at //1 is first. If this combination is not unique, the aggregation function could be changed to collect_list in order to get an array of all matching values.