Read mutliple parquet files to pandas with select columns where select columns exist-CodePudding

When running the below i hit an error due to some of the files missing the required columns


li = []

for filename in parquet_filtered_list:
    df = pd.read_parquet(filename,
                         columns = list_key_cols_aggregates
                     )
    li.append(df)

df_raw_2021_to_2022 = pd.concat(li, axis=0, ignore_index=False)
del li

How do i skip the file if it is missing the required columns.

CodePudding user response：

We are able to read a parquet file's schema and metadata with pyarrow.parquet.read_schema before loading it into a Dataframe:

import pyarrow as pa
import pyarrow.parquet as pq

df_raw_2021_to_2022 = pd.concat([pd.read_parquet(fname, columns=list_key_cols_aggregates)
                                 for fname in parquet_filtered_list
                                 if set(pq.read_schema(fname).names).issuperset(set(list_key_cols_aggregates))],
                                axis=0, ignore_index=False)

Column names list (of parquet file) is presented with schema.names.