Home > Software engineering >  Combine big data stored in subdirectories as 100,000 CSV files of size 200 GB with Python
Combine big data stored in subdirectories as 100,000 CSV files of size 200 GB with Python

Time:10-21

I want to create an algorithm to extract data from csv files in different folders / subfolders. each folder will have 9000 csvs. and we will have 12 of them. 12*9000. over 100,000 files

CodePudding user response:

You can read CSV-files using pandas and store them space efficiently on disk:

import pandas as pd
file = "your_file.csv"
data = pd.read_csv(file)
data = data.astype({"column1": int})
data.to_hdf("new_filename.hdf", "key")

Depending on the contents of your file, you can make adjustments to read_csv as described here:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Make sure that after you've read your data in as a dataframe, the column types match the types they are holding. This way you can save a lot of storage in memory and later when saving these dataframes to disk. You can use astype to make these adjustments. After you've done that, store your dataframe to disk with to_hdf.
If your data is compatible across csv-files, you can append the dataframes onto each other into a larger dataframe.

CodePudding user response:

If the files have consistent structure (column names and column order), then dask can create a large lazy representation of the data:

from dask.dataframe import read_csv

ddf = read_csv('my_path/*/file_*.csv')

# do something

CodePudding user response:

This is working solution for over 100,000 files

Credits : Abhishek Thakur - https://twitter.com/abhi1thakur/status/1358794466283388934

import pandas as pd
import glob 
import time

    start = time.time()
    
    path = 'csv_test/data/'
    all_files = glob.glob(path   "/*.csv")
    l = []
    
    for filename in all_files:
      df = pd.read_csv(filename, index_col=None, header = 0)
      l.append(df)
    
    frame = pd.concat(l, axis = 0, ignore_index = True)
    frame.to_csv('output.csv', index = False)
    
    end = time.time()
    print(end - start)

not sure if it can handle data of size 200 gb. - need feedback regarding this

  • Related