Convert nested dictionary of string keys and array values to pyarrow Table-CodePudding

I have data in the form of a nested Python dictionary that I would like to serialize:

{
    top_value: [
        {
            "probabilities": prob_array,
            "metrics": {
                "metric_a": a_array,
                "metric_b": b_array,
                "metric_c": c_array
            }
        }
    ]
}

where all *_array variables are Numpy arrays.

Since the Numpy arrays are a somewhat large (more than 1000 items) and there many top_value keys and values, I feel like JSON is not suitable for this job and would like to use Apache Feather, even if numpy.savez and h5py could technically handle this. Does PyArrow and Apache Feather actually support this level of nesting?

To serialize Apache Feather, I need to convert my data to pyarrow.Table before serializing. How do I do this? Can PyArrow infer this schema automatically from the data? If not, how do I define it? I tried something like:

schema = pa.schema(
    [
        pa.field("lambda_diversity_const", pa.struct(
            [
                pa.field("probabilities", pa.float64()),
                pa.map_(pa.string(), pa.float64())
            ]
        ))
    ]
)

But got the error:

TypeError: Cannot convert pyarrow.lib.MapType to pyarrow.lib.Field

So I must be defining the nesting wrong.

CodePudding user response：

Does PyArrow and Apache Feather actually support this level of nesting?

Yes PyArrow it does

Can PyArrow infer this schema automatically from the data?

In your case it can't. Arrow supports both maps and struct, and would not know which one to use. You'll have to provide the schema explicitly.

So I must be defining the nesting wrong.

You are missing a pa.field

schema = pa.schema(
    [
        pa.field("lambda_diversity_const", pa.struct(
            [
                pa.field("probabilities", pa.float64()),
                pa.field("metrics", pa.map_(pa.string(), pa.float64()))
            ]
        ))
    ]
)

Also in your case given your fields are arrays you need to use pa.list_(pa.float64()):

schema = pa.schema(
    [
        pa.field("lambda_diversity_const", pa.struct(
            [
                pa.field("probabilities", pa.list_(pa.float64())),
                pa.field("metrics", pa.map_(pa.string(), pa.list_(pa.float64())))
            ]
        ))
    ]
)

But it is not going to be very usable. Arrow is optimised for structured tabular data, so you may want to change the schema to something flat anyway, like:

schema = pa.schema(
    [
        pa.field("batch_id", pa.string()),
        pa.field("record_id", pa.string()),
        pa.field("probability", pa.float64()),
        pa.field("metric_a", pa.float64()),
        pa.field("metric_b", pa.float64()),
        pa.field("metric_c", pa.float64()),
    ]
)