Given some rows coming from a SQL data source with an schema like...
| A | B | C | D | E | F |
... I'd like to transform it into:
{
A: {
invented: { B, C }
D,
E
F
}
}
AFAIK, dataFrame.withColumn
won't let me implement such transformation (it doesn't support nesting a struct into a first-level struct)
Is my goal ever possible?
CodePudding user response:
I think that following code should work (if I understood correctly your question):
df
.withColumn("nested_struct", struct(
col("A"),
struct(
col("B"),
struct(
col("C"),
struct(col("E"), col("F"))
),
col("D")
)
)
)
CodePudding user response:
First of all, thanks to @partlov and his answer. Actually, when I first posted my question, I forgot to mention that some nested struct had to own an unexistent column.
That said, the issue was very easy to resolve.
My first attemp was:
dataFrame.WithColumn("invented",
Struct
(
Struct("invented2", "A"),
))
But this was throwing an exceptions: Spark complained "could not resolve 'invented'" because invented
isn't in the schema.
Then I realized that I could try to don't provide "invented"
at all. And it worked, but Spark created col1
. Finally, I tried to alias col1
, and it has solved the issue!
dataFrame.WithColumn("invented",
Struct
(
Struct("invented2", "A").As("X"),
));
Note: above sample is C# code, and I'm using .NET for Spark. Anyway, it should work the same way in Scala, Python, Java, R...