Merge multiple large csv files quickly-CodePudding

I have multiple large csv files. Each file has a size of 1 GB to 7 GB.

File 1 has this structure.

ID, col1, col2, col3

1   23      22     27
2   67      29     22
3   34      34     23

File 2 has this structure.

ID, col1, col2, col3

4   23      22     27
5   67      29     22
6   34      34     23

i.e. The ID is unique in all files.

I would like to merge these files into a single csv file and then sort the rows based on the ID column. The resulting file will be around 75 GB. And it starts with the ID 1.

At the moment I'm doing like this.

import pandas as pd

CHUNK_SIZE = 10000000 # Number of Rows
output_file = 'combined.csv'

for csv_file_name in sorted_fnames:
    chunk_container = pd.read_csv(csv_file_name, chunksize=CHUNK_SIZE)
    print(csv_file_name)
    for chunk in chunk_container:
        chunk.to_csv(output_file, mode="a", index=False)

And then I'm sorting the file like this.

sort --parallel=2 -t, -k1,1 -n combined.csv > combined_sorted.csv

However, the merging process is incredibly slow. It takes more than an hour to merge the file.

Note: I have only 16 GB RAM. That's why I'm using the chunking option.

Is there any fastest solution available?

Thanks

CodePudding user response：

on the second thought , you can use hdf5 structure that handles big data really well:

import pandas as pd

hdf_path = '_combined.h5'

with pd.HDFStore(hdf_path, mode='w', complevel=5, complib='blosc') as store:
    for csv_file_name in sorted_fnames:
        store.append('data', pd.read_csv(csv_file_name), index=False)

you eventually can save it back to csv, if you wanted, but working with hdf5 would be more effeient

CodePudding user response：

Found a fast solution. File processed in few minutes instead of hours.

The following assumes you don't have header row in all csv files. If you have header row, you need to remove that first. [Note: No need to fix if you have header row only in the first csv file]

import subprocess
sorted_fnames = ["1.csv",
                 "2.csv",
                 "3.csv"]

my_cmd = ['cat']   sorted_fnames
with open('combined.csv', "w") as outfile:
    subprocess.run(my_cmd, stdout=outfile)

if you wanna sort, the you can use the sort command.

sort --parallel=2 -t, -k1,1 -n combined.csv > combined_sorted.csv