Home > front end >  Make array of struct in 2 dataframes identical ( Java Spark )
Make array of struct in 2 dataframes identical ( Java Spark )

Time:12-21

I have two data-frame (Dataset<Row>) with the same columns, but different order array of structs.

df1:

root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
|    |-- element: struct (containsNull = true)
|    |    |-- array_id: integer (nullable = false)
|    |    |-- array_value: string (nullable = false)

 ---- ------------ 
|root|array_nested|
 ---- ------------ 
|One |[[1, 1-One]]|
 ---- ------------ 

df2:

root
|-- root: string (nullable = false)
|-- array_nested: array (nullable = false)
|    |-- element: struct (containsNull = true)
|    |    |-- array_value: string (nullable = false)
|    |    |-- array_id: integer (nullable = false)


 ---- ------------ 
|root|array_nested|
 ---- ------------ 
|Two |[[2-Two, 2]]|
 ---- ------------ 

I want make the schema the same, but when I try my approach it generates and extra later of array:

List<Column> updatedStructNames = new ArrayList<>();
updatedStructNames.add(col("array_nested.array_id"));
updatedStructNames.add(col("array_nested.array_value"));

Column[] updatedStructNameArray = updatedStructNames.toArray(new Column[0]);
Dataset<Row> df3 = df2.withColumn("array_nested", array(struct(updatedStructNameArray)));

It will get generate schema like this:

root
 |-- root: string (nullable = false)
 |-- array_nested: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- array_id: array (nullable = false)
 |    |    |    |-- element: integer (containsNull = true)
 |    |    |-- array_value: array (nullable = false)
 |    |    |    |-- element: string (containsNull = true)

 ---- ---------------- 
|root|array_nested    |
 ---- ---------------- 
|Two |[[[2], [2-Two]]]|
 ---- ---------------- 

How can I achieve the same schema?

CodePudding user response:

You can use transform function to update the struct elements of array_nested column:

Dataset < Row > df3 = df2.withColumn(
    "array_nested",
    expr("transform(array_nested, x -> struct(x.array_id as array_id, x.array_value as array_value))")
);
  • Related