Home > Net >  how to change pyarrow table column precision for multi level index/column DataFrames
how to change pyarrow table column precision for multi level index/column DataFrames

Time:09-16

I have a pyarrow.Table that's created from a pandasDataFrame

    df = pd.DataFrame({"col1": [1.0, 2.0],  "col2": [2.3, 2.4]})
    df.columns = pd.MultiIndex.from_tuples([('a',100),('b',200)], names=('name', 'number'))
    df.index = pd.MultiIndex.from_tuples([('a',100),('b',200)], names=('name', 'number'))

    table = pa.Table.from_pandas(df)

The original df has thousands of columns and rows, and the values are all float64, and therefore become double when I convert to pyarrow Table

How can I change them all to float32?

I tried the following:

    schema = pa.schema([pa.field("('a',100)", pa.float32()),pa.field("('b',200)", pa.float32()),])
    table = pa.Table.from_pandas(df, schema=schema)

but that complains about the schema and the dataframe not matching: KeyError: "name '('a',100)' present in the specified schema is not found in the columns or index"

CodePudding user response:

First convert the data frame to a table and then change the schema so that every float64 is turned to float32:

table = pa.Table.from_pandas(df)
schema = pa.schema(
    [
        pa.field(f.name, pa.float32() if f.type == pa.float64() else f.type) 
        for f in table.schema
    ]
)

table.cast(schema)

CodePudding user response:

You can cast the table to the types you need

table = pa.Table.from_pandas(df)
table = table.cast(pa.schema([("('a', '100')", pa.float32()), 
                              ("('b', '200')", pa.float32()), 
                              ("name", pa.string()), 
                              ("number", pa.string())]))

I doubt you will find a way to provide a working schema to Table.from_pandas when using a Pandas multikey index. The name of the column in that case is a tuple (('a', 100)) but for Arrow schema column names can only be strings. So you will never be able to create a schema that points to the same column names that the dataframe has.

That's why casting afterward works, because after you made an Arrow table (and thus all column names became strings) you can finally provide the string equal to the column name to the cast function.

  • Related