Home > Blockchain >  Spark Dataframe - Merge Nested Columns into one
Spark Dataframe - Merge Nested Columns into one

Time:10-08

I'm trying to merge a dataframe that has multiple nested struct columns into one single struct column. Here's a same schema

Before:

|- column_1
|- column_2
|- column_3
|- nested_details_1
   |- a
   |- b
   |- c
|- nested_details_2
   |- d
   |- e
   |- f

Desired after:

|- column_1
|- column_2
|- column_3
|- nested_details
   |- a
   |- b
   |- c
   |- d
   |- e
   |- f

Visually I can see what needs to be done, but is there a function with data frames to merge columns for me? If it makes any difference I am using AWS Glue Dynamic Dataframe but I can convert easily enough to a normal data frame.

CodePudding user response:

pyspark

I managed to merge 2 structs using the pyspark struct function like this:

from pyspark.sql import SparkSession
from pyspark.sql.functions import struct, col, expr

spark = SparkSession.builder.getOrCreate()

input_df = spark.sql("""
    SELECT 
        1 as column_1,
        2 as column_2,
        3 as column_3,
        named_struct("a", 1, "b", 2, "c", 3) as nested_details_1,
        named_struct("d", 4, "e", 5, "f", 6) as nested_details_2
""")
merged_df = input_df.select(
    col("column_1"),
    col("column_2"),
    col("column_3"),
    struct(col("nested_details_1.*"),col("nested_details_2.*")).alias("nested_details")
)
merged_df.printSchema()
merged_df.show()
merged_df.select(expr("typeof(nested_details)")).show(truncate=False)

above prints

root
 |-- column_1: integer (nullable = false)
 |-- column_2: integer (nullable = false)
 |-- column_3: integer (nullable = false)
 |-- nested_details: struct (nullable = false)
 |    |-- a: integer (nullable = false)
 |    |-- b: integer (nullable = false)
 |    |-- c: integer (nullable = false)
 |    |-- d: integer (nullable = false)
 |    |-- e: integer (nullable = false)
 |    |-- f: integer (nullable = false)

 -------- -------- -------- ------------------ 
|column_1|column_2|column_3|    nested_details|
 -------- -------- -------- ------------------ 
|       1|       2|       3|{1, 2, 3, 4, 5, 6}|
 -------- -------- -------- ------------------ 

 ------------------------------------------- 
|typeof(nested_details)                     |
 ------------------------------------------- 
|struct<a:int,b:int,c:int,d:int,e:int,f:int>|
 ------------------------------------------- 

Spark SQL

same can be achieved with SQL also using SQL struct function like this:

with input as (
SELECT 
    1 as column_1,
    2 as column_2,
    3 as column_3,
    named_struct("a", 1, "b", 2, "c", 3) as nested_details_1,
    named_struct("d", 4, "e", 5, "f", 6) as nested_details_2 
)
SELECT 
    column_1,
    column_2, 
    column_3, 
    struct(nested_details_1.*, nested_details_2.*) as nested_details
FROM input
  • Related