Home > Back-end >  Spark: retrieve datatype of nested Struct column
Spark: retrieve datatype of nested Struct column

Time:10-25

I am currently working on a job that loads a nested Json file as a dataframe, performs some transformations on it and then loads it into a delta table. The testdata I work with has a lot of nested columns, but its possible that the json files the job will load in the future dont come with all the columns all the time (or that they have different datatypes). Therefore, I want to first check if a column is there and which datatype it has. Problem: I dont get it to work, because I dont know how to derive the datatype of a column from the nested schema of the dataframe.

Example: How can I get the datatype of ecuId?

enter image description here

my approach so far was:

df.withColumn("datatype", isinstance(col("reportData.ecus.element.ecuId"), (float, int, str, list, dict, tuple)))

or

df.withColumn("datatype", isinstance(jsonDF.reportData.ecus.element.ecuId, (float, int, str, list, dict, tuple)))

For both versions I get the error message: "col should be Column" Even when I try a very basic

df.withColumn("datatype", type(jsonDF.reportData.ecus.element.ecuId))

I get the same error. It appears as if I have a complete misconception of how to work with nested structures? Can you please explain to me how I get the datatype? Thanks a lot in advance!

CodePudding user response:

The reason you got the error col should be Column, is because withColumn expects the second parameter as a Column object, not a pure Python object.

The closest approach I got is a bit "hacky", by parsing the schema of dataframe manually.

(df
    .withColumn('schema', F.lit(df.dtypes[0][1]))
    .withColumn('datatype', F.regexp_extract('schema', 'ecus.*.ecuId:([^>]*)', 1))
    .show(10, False)
)
# Output
#  ---------- ---------------------------------------- -------- 
# |reportData|schema                                  |datatype|
#  ---------- ---------------------------------------- -------- 
# |{[{1000}]}|struct<ecus:array<struct<ecuId:bigint>>>|bigint  |
#  ---------- ---------------------------------------- -------- 
  • Related