I tried to read a parquet (.parq
) file I have stored in a GitHub project, using the following script:
import pandas as pd
import numpy as np
import ipywidgets as widgets
import datetime
from ipywidgets import interactive
from IPython.display import display, Javascript
import warnings
warnings.filterwarnings('ignore')
parquet_file = r'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')
and it gave me this error:
ArrowInvalid: Could not open Parquet input source '': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Does anyone know what this error message means and how I can load the file in my GitHub repository? Thank you in advance.
CodePudding user response:
You should use the URL under the domain raw.githubusercontent.com
.
As for your example:
parquet_file = 'https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')
CodePudding user response:
You can read parquet files directly from a web URL like this. However, when reading a data file from a git repository you need to make sure it is the raw file url:
url = 'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq?raw=true'