When running the below i hit an error due to some of the files missing the required columns
li = []
for filename in parquet_filtered_list:
df = pd.read_parquet(filename,
columns = list_key_cols_aggregates
)
li.append(df)
df_raw_2021_to_2022 = pd.concat(li, axis=0, ignore_index=False)
del li
How do i skip the file if it is missing the required columns.
CodePudding user response:
We are able to read a parquet file's schema and metadata with pyarrow.parquet.read_schema
before loading it into a Dataframe:
import pyarrow as pa
import pyarrow.parquet as pq
df_raw_2021_to_2022 = pd.concat([pd.read_parquet(fname, columns=list_key_cols_aggregates)
for fname in parquet_filtered_list
if set(pq.read_schema(fname).names).issuperset(set(list_key_cols_aggregates))],
axis=0, ignore_index=False)
Column names list (of parquet file) is presented with schema.names
.