How to concatenate thousands of Pandas DataFrames efficiently?-CodePudding

I have a folder /data/csvs which contains ~7000 CSV files each with ~600 lines. Each CSV has a name which contains a timestamp that needs to be preserved e.g. /data/csvs/261121.csv, /data/csvs/261122.csv (261121 being 26/11/21 today's date).

I need to:

Load each CSV.
Add a column in which the timestamp can be saved so I know which file the data came from. The time increases by half a second each row so this row also shows the hour/minute/second/microseconds.
Combine the rows into one table which will span a month of data.
Ideally I'd like the final product to be a DataFrame.

Currently this is what I'm doing:

    files = os.listdir('/data/csvs')
    csv_names = []
    for file_name in files:
        if file_name[-4:] == '.csv':
            csv_names.append(file_name)

    to_process = len(csv_names)
    for i, csv_name in enumerate(csv_names):
        df = pd.read_csv(f'{csv_folder_path}/{file_name}')
        df = timestamp(df, csv_name)

        to_process = to_process-1

        if i == 0:
            concat_df = df
            concat_df.to_feather(path=processed_path)
        else:
            concat_df = pd.concat([concat_df, df])

            if to_process % 100 == 0:
                saved_df = pd.read_feather(path=processed_path)
                concat_df = pd.concat([saved_df, concat_df])
                concat_df.reset_index(drop=True, inplace=True)
                concat_df.to_feather(path=processed_path)

I'm loading in each CSV as a DataFrame, adding the timestamp column and concatenating the CSVs 100 at a time (because I thought this would reduce memory usage) and then saving 100 CSVs at a time to a large DataFrame feather file. This is really slow and uses loads of memory.

What is a more efficient way of doing this?

CodePudding user response：

First, you could be more efficient loading your files using glob. This saves you iterating over all the files and checking whether the file-extension is ".csv"

import glob

src = '/data/csvs'
files = glob.iglob(os.path.join(src, "*.csv"))

Then, read all files into a df and add them to a generator, in the same step assigning the basename of the file to a column named timestamp

df_from_each_file = (pd.read_csv(f).assign(timestamp=os.path.basename(f).split('.')[0]) for f in files)

And finally concatenate the dfs into one

csv_data = pd.concat(df_from_each_file, ignore_index=True)

Hope this helped! I have used a process like this for large amounts of data and found it efficient enough.