Loading a parquet file from a GitHub repository-CodePudding

I tried to read a parquet (.parq) file I have stored in a GitHub project, using the following script:

import pandas as pd
import numpy as np
import ipywidgets as widgets
import datetime
from ipywidgets import interactive
from IPython.display import display, Javascript

import warnings
warnings.filterwarnings('ignore')


parquet_file = r'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq'

df = pd.read_parquet(parquet_file, engine='auto')

and it gave me this error:

ArrowInvalid: Could not open Parquet input source '': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

Does anyone know what this error message means and how I can load the file in my GitHub repository? Thank you in advance.

CodePudding user response：

You should use the URL under the domain raw.githubusercontent.com.

As for your example:

parquet_file = 'https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')

CodePudding user response：

You can read parquet files directly from a web URL like this. However, when reading a data file from a git repository you need to make sure it is the raw file url:

url = 'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq?raw=true'