I'm trying to merge a dataframe that has multiple nested struct columns into one single struct column. Here's a same schema
Before:
|- column_1
|- column_2
|- column_3
|- nested_details_1
|- a
|- b
|- c
|- nested_details_2
|- d
|- e
|- f
Desired after:
|- column_1
|- column_2
|- column_3
|- nested_details
|- a
|- b
|- c
|- d
|- e
|- f
Visually I can see what needs to be done, but is there a function with data frames to merge columns for me? If it makes any difference I am using AWS Glue Dynamic Dataframe but I can convert easily enough to a normal data frame.
CodePudding user response:
pyspark
I managed to merge 2 structs using the pyspark struct
function like this:
from pyspark.sql import SparkSession
from pyspark.sql.functions import struct, col, expr
spark = SparkSession.builder.getOrCreate()
input_df = spark.sql("""
SELECT
1 as column_1,
2 as column_2,
3 as column_3,
named_struct("a", 1, "b", 2, "c", 3) as nested_details_1,
named_struct("d", 4, "e", 5, "f", 6) as nested_details_2
""")
merged_df = input_df.select(
col("column_1"),
col("column_2"),
col("column_3"),
struct(col("nested_details_1.*"),col("nested_details_2.*")).alias("nested_details")
)
merged_df.printSchema()
merged_df.show()
merged_df.select(expr("typeof(nested_details)")).show(truncate=False)
above prints
root
|-- column_1: integer (nullable = false)
|-- column_2: integer (nullable = false)
|-- column_3: integer (nullable = false)
|-- nested_details: struct (nullable = false)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
| |-- c: integer (nullable = false)
| |-- d: integer (nullable = false)
| |-- e: integer (nullable = false)
| |-- f: integer (nullable = false)
-------- -------- -------- ------------------
|column_1|column_2|column_3| nested_details|
-------- -------- -------- ------------------
| 1| 2| 3|{1, 2, 3, 4, 5, 6}|
-------- -------- -------- ------------------
-------------------------------------------
|typeof(nested_details) |
-------------------------------------------
|struct<a:int,b:int,c:int,d:int,e:int,f:int>|
-------------------------------------------
Spark SQL
same can be achieved with SQL also using SQL struct
function like this:
with input as (
SELECT
1 as column_1,
2 as column_2,
3 as column_3,
named_struct("a", 1, "b", 2, "c", 3) as nested_details_1,
named_struct("d", 4, "e", 5, "f", 6) as nested_details_2
)
SELECT
column_1,
column_2,
column_3,
struct(nested_details_1.*, nested_details_2.*) as nested_details
FROM input