Retrieving data from multiple parquet files into one dataframe (Python)-CodePudding

I want to start by saying this is the first time I work with Parquet files. I have a list of 2615 parquet files that I downloaded from an S3 bucket and I want to read them into one dataframe. They follow the same folder structure and I am putting an example below:

/Forecasting/as_of_date=2022-02-01/type=full/export_country=Spain/import_country=France/000.parquet'

The file name 000.parquet is always the same, irrespective of folder.

I saved all of the file locations using the following function:

import os
def list_files(dir):
    r = []
    for root, dirs, files in os.walk(dir):
        for name in files:
            r.append(os.path.join(root, name))
    return r

This generates a list of all file locations, exactly like in the folder example above.

The next thing I tried was using DASK to read all of the parquet files into a dask dataframe but it doesn't seem to work.

import dask.dataframe as dd
dask_df = dd.read_parquet(data_files)

I keep getting this error and I'm not sure how to fix it, although I understand where the issue is. It's because the files contain the columns export_country and import_country, which are also partitions:

ValueError: No partition-columns should be written in the 
file unless they are ALL written in the file.

Another solution I tried using was iterating through each parquet file using pandas and combining everything into one dataframe.

df = pd.DataFrame()
for f in data_files:
    data = pd.read_parquet(f,engine = 'pyarrow')
    df = df.append(data)

This seems to take ages and my kernel dies due to no more RAM available.

CodePudding user response：

It's faster to do a single concat compared to append multiple times:

df = pd.concat((pd.read_parquet(f, engine = 'pyarrow') for f in data_files))

but I doubt it helps with the memory limitation.

CodePudding user response：

I would also agree to RAM limitations with so many files ... One "dirty" solution could be concat chunks of data_files and not all together.

part1 = data_files[:10]
part2 = list[10:20]
part3 = list[20:30]
part4 = list[30:40]

I case you have enough memory on harddisk you can try storing them, just append to one "big file" but I guess in the case loading this file it will run also in out-of-memory.

In cloud there a solutions for such big amount of data and clusters and tools to analyse them. Just use one of those tools for data analysis. :)

CodePudding user response：

A variation of @Learning is a mess's answer, but using dd.concat:

from dask.dataframe import read_parquet, concat
dask_df = concat([read_parquet(f) for f in data_files])