Home > Enterprise >  Fast multi-part file merge on Linux
Fast multi-part file merge on Linux

Time:12-21

I have 6,369 files of 256 MB each (1.63 TB total) stored in a RAM disk volume on a Linux server equipped with 4 TB of RAM. I need to merge them into a single file stored in the same RAM disk. What kind of merge operation would give me the best performance? If more RAM is needed, I can store the original parts on a 1.9 TB NVMe drive. The server has 128 cores.

Notes:

  • Files are already compressed
  • We do not have any limitations regarding available RAM or NVMe

CodePudding user response:

Given that these files are ordered in a way (such as continuous numbering or formatted date), cat should do the trick from the shell prompt:

cat single*.dat > combined.dat

You may want to make sure, that sorting is no issue in your particular shell: https://unix.stackexchange.com/questions/368318/does-the-bash-star-wildcard-always-produce-an-ascending-sorted-list

Other than that, the number of input files, when using the command line (instead of scripting) should be not relevant, but you still should check with your setup beforehand: https://unix.stackexchange.com/questions/356386/is-there-a-maximum-to-bash-file-name-expansion-globbing-and-if-so-what-is-it

CodePudding user response:

It sounds like you don't have an issue of limited memory, so you should just do what Lecraminos suggested.

If there is an issue of limited space, you can have your (hopefully temporary) destination compressed by using

cat single*.dat | gzip > combined.dat.gz

or maybe go over each file and remove each file from the (hopefully temporary) storage after you use it:

for file in single*.dat; do
  cat "$file"
  rm -f "$file"
done > combined.dat

or both...

  • Related