Home > Net >  Data format inconsistency during read/write parquet file with spark
Data format inconsistency during read/write parquet file with spark

Time:07-21

Here is the schema of the input data that I read from a file myfile.parquet with spark/scala :

val df = spark.read.format("parquet").load("/usr/sample/myIntialFile.parquet")
df.printSchema

root
|-- details: string (nullable = true)
|-- infos: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- text: string (nullable = true)
|    |    |-- id: string (nullable = true)
|    |    |-- value: string (nullable = true)

Then, I just made a .select("infos") and write the dataframe as a parquet file (let say as sparkProcessedFile.parquet). And of course, the data schema of infos column remained unchanged.

On the other hand, when I try to compare schemas of myIntialFile.parquet and sparkProcessedFile.parquet using pyarrow, I don't get the same data schema :

import pyarrow.parquet as pa

initialTable = pa.read_table('myIntialFile.parquet')
initialTable.schema 

infos: list<array: struct<text: string, id: string, value: string> not null>


sparkProcessedTable = pa.read_table('sparkProcessedFile.parquet')
sparkProcessedTable.schema 

infos: list<element: struct<text: string, id: string, value: string>>

I don't understand why there is a difference (list<array<struct>> instead of list<struct>) and why spark changed the initial nested structure with a simple select.

Thanks for any suggestions.

CodePudding user response:

The actual data type didn't change. In both cases infos is a variable sized list of structs. In other words, each item in the infos array is a list of structs.

Arguably, there isn't much point in the name array or element. I think different parquet readers/writers basically just make something up here. Note that pyarrow will call the field item when creating a new array from memory:

>>> pa.list_(pa.struct([pa.field('text', pa.string()), pa.field('id', pa.string()), pa.field('value', pa.string())]))
ListType(list<item: struct<text: string, id: string, value: string>>)

It appears that Spark is normalizing the "list element name" to element (or perhaps whatever is in its schema) regardless of what is actually in the parquet file. This seems like a reasonable thing to do (although one could imagine it causing a compatibility issue)

Perhaps more concerning is the fact that the field items changed from "not null" to "nullable". Again, Spark reports the field as "nullable" so either Spark has decided that all array columns are nullable or Spark had decided the schema required that to be nullable in some other way.

  • Related