Here is the schema of the input data that I read from a file myfile.parquet
with spark/scala :
val df = spark.read.format("parquet").load("/usr/sample/myIntialFile.parquet")
df.printSchema
root
|-- details: string (nullable = true)
|-- infos: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- text: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- value: string (nullable = true)
Then, I just made a .select("infos")
and write the dataframe as a parquet file (let say as sparkProcessedFile.parquet
). And of course, the data schema of infos
column remained unchanged.
On the other hand, when I try to compare schemas of myIntialFile.parquet
and sparkProcessedFile.parquet
using pyarrow, I don't get the same data schema :
import pyarrow.parquet as pa
initialTable = pa.read_table('myIntialFile.parquet')
initialTable.schema
infos: list<array: struct<text: string, id: string, value: string> not null>
sparkProcessedTable = pa.read_table('sparkProcessedFile.parquet')
sparkProcessedTable.schema
infos: list<element: struct<text: string, id: string, value: string>>
I don't understand why there is a difference (list<array<struct>>
instead of list<struct>
) and why spark changed the initial nested structure with a simple select.
Thanks for any suggestions.
CodePudding user response:
The actual data type didn't change. In both cases infos
is a variable sized list of structs. In other words, each item in the infos
array is a list of structs.
Arguably, there isn't much point in the name array
or element
. I think different parquet readers/writers basically just make something up here. Note that pyarrow will call the field item
when creating a new array from memory:
>>> pa.list_(pa.struct([pa.field('text', pa.string()), pa.field('id', pa.string()), pa.field('value', pa.string())]))
ListType(list<item: struct<text: string, id: string, value: string>>)
It appears that Spark is normalizing the "list element name" to element
(or perhaps whatever is in its schema) regardless of what is actually in the parquet file. This seems like a reasonable thing to do (although one could imagine it causing a compatibility issue)
Perhaps more concerning is the fact that the field items
changed from "not null" to "nullable". Again, Spark reports the field as "nullable" so either Spark has decided that all array columns are nullable or Spark had decided the schema required that to be nullable in some other way.