Parquet file created in Windows cannot be opened in Ubuntu-CodePudding

So I have created a parquet file on my Windows 10 computer using the following lines

# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0

import pandas as pd

df = pd.DataFrame(dict(x=[0, 1, 2], y=[3, 4, 5]))
df.to_parquet('some/path/to/my/windows_parquet_file.parquet')

Now I'm creating a pipeline in Azure Pipelines where I want to load in that same file by executing a Python script. The OS of the agent executing the python script is Ubuntu 20.04.4. The contents of that script:

# pandas and pyarrow installed using pip on Python 3.9
# pip install pandas==1.4.2
# pip install pyarrow==7.0.0

import pandas as pd

parquet_file_path = 'some/path/to/my/windows_parquet_file.parquet'
df = pd.read_parquet(parquet_file_path)

However, this last line gives me an error

Traceback (most recent call last):
  File "/home/vsts/work/_temp/ec5ac2c3-4983-41d5-abe4-cd532dafb5af.py", line 4, in <module>
    df = pd.read_parquet(parquet_file_path)
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pandas/io/parquet.py", line 240, in read
    result = self.api.parquet.read_table(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1960, in read_table
    dataset = _ParquetDatasetV2(
  File "/opt/hostedtoolcache/Python/3.9.12/x64/lib/python3.9/site-packages/pyarrow/parquet.py", line 1766, in __init__
    [fragment], schema=fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 797, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Does anyone know why this issue is raised and how to solve it? I've been browsing over the internet but I couldn't find anything pointing out differences in writing/reading parquet files on different OS.

Python version is 3.9 both on my PC as on the agent VM.

CodePudding user response：

So after spending quite some more hours into this problem I found the cause for this error and the solution. The issue has nothing to do with different OS or package version or whatsoever.

The file that I referred to is part of GIT lfs. Therefore the file was not a parquet file anymore but a link to such a file. The solution was to make sure that any relevant files are downloaded before trying to access. In my specific case using Azure Pipelines I found the solution here: How to use Git LFS with Azure Repos and Pipelines