I have data in the form of a nested Python dictionary that I would like to serialize:
{
top_value: [
{
"probabilities": prob_array,
"metrics": {
"metric_a": a_array,
"metric_b": b_array,
"metric_c": c_array
}
}
]
}
where all *_array
variables are Numpy arrays.
Since the Numpy arrays are a somewhat large (more than 1000 items) and there many top_value
keys and values, I feel like JSON is not suitable for this job and would like to use Apache Feather, even if numpy.savez
and h5py
could technically handle this. Does PyArrow and Apache Feather actually support this level of nesting?
To serialize Apache Feather, I need to convert my data to pyarrow.Table
before serializing. How do I do this? Can PyArrow infer this schema automatically from the data? If not, how do I define it? I tried something like:
schema = pa.schema(
[
pa.field("lambda_diversity_const", pa.struct(
[
pa.field("probabilities", pa.float64()),
pa.map_(pa.string(), pa.float64())
]
))
]
)
But got the error:
TypeError: Cannot convert pyarrow.lib.MapType to pyarrow.lib.Field
So I must be defining the nesting wrong.
CodePudding user response:
Does PyArrow and Apache Feather actually support this level of nesting?
Yes PyArrow it does
Can PyArrow infer this schema automatically from the data?
In your case it can't. Arrow supports both maps and struct, and would not know which one to use. You'll have to provide the schema explicitly.
So I must be defining the nesting wrong.
You are missing a pa.field
schema = pa.schema(
[
pa.field("lambda_diversity_const", pa.struct(
[
pa.field("probabilities", pa.float64()),
pa.field("metrics", pa.map_(pa.string(), pa.float64()))
]
))
]
)
Also in your case given your fields are arrays you need to use pa.list_(pa.float64())
:
schema = pa.schema(
[
pa.field("lambda_diversity_const", pa.struct(
[
pa.field("probabilities", pa.list_(pa.float64())),
pa.field("metrics", pa.map_(pa.string(), pa.list_(pa.float64())))
]
))
]
)
But it is not going to be very usable. Arrow is optimised for structured tabular data, so you may want to change the schema to something flat anyway, like:
schema = pa.schema(
[
pa.field("batch_id", pa.string()),
pa.field("record_id", pa.string()),
pa.field("probability", pa.float64()),
pa.field("metric_a", pa.float64()),
pa.field("metric_b", pa.float64()),
pa.field("metric_c", pa.float64()),
]
)