I have a pyarrow.Table that's created from a pandasDataFrame
df = pd.DataFrame({"col1": [1.0, 2.0], "col2": [2.3, 2.4]})
df.columns = pd.MultiIndex.from_tuples([('a',100),('b',200)], names=('name', 'number'))
df.index = pd.MultiIndex.from_tuples([('a',100),('b',200)], names=('name', 'number'))
table = pa.Table.from_pandas(df)
The original df has thousands of columns and rows, and the values are all float64
, and therefore become double
when I convert to pyarrow Table
How can I change them all to float32
?
I tried the following:
schema = pa.schema([pa.field("('a',100)", pa.float32()),pa.field("('b',200)", pa.float32()),])
table = pa.Table.from_pandas(df, schema=schema)
but that complains about the schema and the dataframe not matching: KeyError: "name '('a',100)' present in the specified schema is not found in the columns or index"
CodePudding user response:
First convert the data frame to a table and then change the schema so that every float64 is turned to float32:
table = pa.Table.from_pandas(df)
schema = pa.schema(
[
pa.field(f.name, pa.float32() if f.type == pa.float64() else f.type)
for f in table.schema
]
)
table.cast(schema)
CodePudding user response:
You can cast the table to the types you need
table = pa.Table.from_pandas(df)
table = table.cast(pa.schema([("('a', '100')", pa.float32()),
("('b', '200')", pa.float32()),
("name", pa.string()),
("number", pa.string())]))
I doubt you will find a way to provide a working schema to Table.from_pandas
when using a Pandas multikey index. The name of the column in that case is a tuple
(('a', 100)
) but for Arrow schema
column names can only be strings. So you will never be able to create a schema that points to the same column names that the dataframe has.
That's why casting afterward works, because after you made an Arrow table (and thus all column names became strings) you can finally provide the string equal to the column name to the cast function.