Home > Blockchain >  Reading and processing multiple csv files with limited RAM in Python
Reading and processing multiple csv files with limited RAM in Python

Time:10-19

I need to read thousands of csv files and output them as a single csv file in Python.

Each of the original files will be used to create single row in the final output with columns being some operation on the rows of the original file.

Due to the combined size of the files, this takes many hours to process and also is not able to be fully loaded into memory.

I am able to read in each csv and delete it from memory to solve the RAM issue. However, I am currently iteratively reading and processing each csv (in Pandas) and appending the output row to the final csv, which seems slow. I believe I can use the multiprocessing library to have each process read and process its own csv, but wasn't sure if there was a better way than this.

What is the fastest way to complete this in Python while having RAM limitations?

As an example, ABC.csv and DEF.csv would be read and processed into individual rows in the final output csv. (The actual files would have tens of columns and hundreds of thousands of rows)

ABC.csv:

id,col1,col2
abc,2.3,3
abc,3.7,5
abc,3.0,9

DEF.csv:

id,col1,col2
def,1.9,3
def,2.8,2
def,1.6,1

Final Output:

id,col1_avg,col2_max
abc,3.0,9
def,2.1,3

CodePudding user response:

I would suggest using dask for this. It's a library that allows you to do parallel processing on large datasets.

import dask.dataframe as dd

df = dd.read_csv('*.csv')
df = df.groupby('id').agg({'col1': 'mean', 'col2': 'max'})
df.to_csv('output.csv')

Code explanation

dd.read_csv will read all the csv files in the current directory and concatenate them into a single dataframe.

df.groupby('id').agg({'col1': 'mean', 'col2': 'max'}) will group the dataframe by the id column and then calculate the mean of col1 and the max of col2 for each group.

df.to_csv('output.csv') will write the dataframe to a csv file.

Performance

I tested this on my machine with a directory containing 10,000 csv files with 10,000 rows each. The code took about 2 minutes to run.

Installation

To install dask, run pip install dask.

  • Related