Error while converting pandas dataframe to polars dataframe (pyarrow.lib.ArrowTypeError: Expected by-CodePudding

I am converting pandas dataframe to polars dataframe but pyarrow throws error.

My code:

import polars as pl
import pandas as pd

if __name__ == "__main__":

    with open(r"test.xlsx", "rb") as f:
        excelfile = f.read()
    excelfile = pd.ExcelFile(excelfile)
    sheetnames = excelfile.sheet_names
    df = pd.concat(
        [
            pd.read_excel(
            excelfile, sheet_name=x, header=0)
                    for x in sheetnames
                    ], axis=0)

    df_pl = pl.from_pandas(df)

Error:

File "pyarrow\array.pxi", line 312, in pyarrow.lib.array

File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array

File "pyarrow\error.pxi", line 122, in pyarrow.lib.check_status

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object

I tried changing pandas dataframe dtype to str and problem is solved, but i don't want to change dtypes. Is it bug in pyarrow or am I missing something?

CodePudding user response：

I can replicate this result. It is due to a column in the original Excel file that contains both text and numbers.

For example, create a new Excel file with one column in which you type both numbers and text, save it, and run your code on that file. I get the following traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/convert.py", line 299, in from_pandas
    return DataFrame._from_pandas(df, rechunk=rechunk, nan_to_none=nan_to_none)
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 454, in _from_pandas
    pandas_to_pydf(
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 485, in pandas_to_pydf
    arrow_dict = {
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 486, in <dictcomp>
    str(col): _pandas_series_to_arrow(
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 237, in _pandas_series_to_arrow
    return pa.array(values, pa.large_utf8(), from_pandas=nan_to_none)
  File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object

There are several lengthy discussions on this issue, such as these:

This particular comment might be relevant, as you are concatenating the results of parsing multiple sheets in an Excel file. This may lead to conflicting dtypes for a column: https://github.com/pandas-dev/pandas/issues/21228#issuecomment-419175116

How to approach this depends on your data and its use, so I can't recommend a blanket solution (i.e., fixing your source Excel file, or changing the dtype to str).

CodePudding user response：

My problem is solved by saving pandas dataframe to 'csv' format and then importing 'csv' file in polars.

import os
import polars as pl
import pandas as pd

if __name__ == "__main__":

    with open(r"test.xlsx", "rb") as f:
        excelfile = f.read()
    excelfile = pd.ExcelFile(excelfile)
    sheetnames = excelfile.sheet_names
    df = pd.concat([pd.read_excel(excelfile, sheet_name=x, header=0) 
                    for x in sheetnames 
                    ], axis=0)
    df.to_csv("temp.csv",index=False)
    df_pl = pl.scan_csv("temp.csv")
    os.remove("temp.csv")