Home > Enterprise >  Parquet file created in Windows cannot be opened in Ubuntu
Parquet file created in Windows cannot be opened in Ubuntu

Time:05-30

So I have created a parquet file on my Windows 10 computer using the following lines

# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0

import pandas as pd

df = pd.DataFrame(dict(x=[0, 1, 2], y=[3, 4, 5]))
df.to_parquet('some/path/to/my/windows_parquet_file.parquet')

Now I'm creating a pipeline in Azure Pipelines where I want to load in that same file by executing a Python script. The OS of the agent executing the python script is Ubuntu 20.04.4. The contents of that script:

# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0

import pandas as pd

parquet_file_path = 'some/path/to/my/windows_parquet_file.parquet'
df = pd.read_parquet(parquet_file_path)

However, this last line gives me an error

Traceback (most recent call last):
  File "/home/vsts/work/_temp/ec5ac2c3-4983-41d5-abe4-cd532dafb5af.py", line 4, in <module>
    df = pd.read_parquet(parquet_file_path)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 240, in read
    result = self.api.parquet.read_table(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1960, in read_table
    dataset = _ParquetDatasetV2(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1766, in __init__
    [fragment], schema=fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 797, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Does anyone know why this issue is raised and how to solve it? I've been browsing over the internet but I couldn't find anything pointing out differences in writing/reading parquet files on different OS.

Python version is 3.9 both on my PC as on the agent VM.

CodePudding user response:

So after spending quite some more hours into this problem I found the cause for this error and the solution. The issue has nothing to do with different OS or package version or whatsoever.

The file that I referred to is part of GIT lfs. Therefore the file was not a parquet file anymore but a link to such a file. The solution was to make sure that any relevant files are downloaded before trying to access. In my specific case using Azure Pipelines I found the solution here: How to use Git LFS with Azure Repos and Pipelines

  • Related