this is my test json file.
{
"item" : {
"fracData" : [ ],
"fractimeData" : [ {
"number" : "1232323232",
"timePeriods" : [ {
"validFrom" : "2021-08-03"
} ]
} ],
"Module" : [ ]
}
}
This is how I read the json file.
starhist_test_df = spark.read.json("/mapr/xxx/yyy/ttt/dev/rawdata/Test.json", multiLine=True)
starhist_test_df.createOrReplaceTempView("v_test_df")
This query works.
df_test_01 = spark.sql("""
select item.fractimeData.number from v_test_df""")
df_test_01.collect();
Result
[Row(number=['1232323232'])]
But this query doesn't work.
df_test_01 = spark.sql("""
select item.fractimeData.timePeriods.validFrom from v_test_df""")
df_test_01.collect();
Error
cannot resolve 'v_test_df.`item`.`fractimeData`.`timePeriods`['validFrom']' due to data type mismatch: argument 2 requires integral type, however, ''validFrom'' is of string type.; line 3 pos 0;
What do I have to change, to read the validFrom field?
CodePudding user response:
try built in function "transform".
It will resolve your query
CodePudding user response:
dot
notation to access values works with struct
or array<struct>
types.
The schema for field number
in item.fractimeData
is string
and accessing it via dot
notation returns an array<string>
since fractimeData
is an array.
Similarly, the schema for field timePeriods
in item.fractimeData
is <array<struct<validFrom>>
, and accessing it via dot
notation wraps it into another array, resulting in final schema of array<array<struct<validFrom>>>
.
The error you get is because the dot
notation can work on array<struct>
but not on array<array>
.
Hence, flatten
the result from item.fractimeData.timePeriods
to get back an array<struct<validFrom>>
and then apply the dot
notation.
df_test_01 = spark.sql("""
select flatten(item.fractimeData.timePeriods).validFrom as validFrom from v_test_df""")
df_test_01.collect()
"""
[Row(validFrom=['2021-08-03', '2021-08-03'])]
"""