How to split large csv files into 125MB-1000MB small csv files dynamically using split command in UN-CodePudding

I am trying to split large csv files to small csv files which is having 125MB to 1GB. split command will work if we give number of records per file it will split but i want get that row count dynamically on basis of file size. if the file size is 20GB then while laoding this whole file into redshift table using copy command but this is taking lot of time, so if we chunk the 20GB file into mentioned size files so i will get good results.
Example 20GB file we can split 6_000_000 records per file so in that way the chunk file size will be around 125mb, in that way i want that 600_000 row count dynamically depends on size

CodePudding user response：

You can get the file size in MB and divide by some ideal size that you need to predetermine (for my example I picked your minimum of 125MB), and that will give you the number of chunks.

You then get the row count (wc -l, assuming your CSV has no line breaks inside a cell) and divide that by the number of chunks to give your rows per chunk.

Rows per chunk is your "lines per chunk" count that you can finally pass to split.

Because we are doing division which will most likely result in a remainder, you'll probably get an extra file with a relatively few amount of these remainder rows (which you can see in the example).

Here's how I coded this up. I'm using shellcheck, so I think this is pretty POSIX compliant:

csvFile=$1

maxSizeMB=125

rm -f chunked_*

fSizeMB=$(du -ms "$csvFile" | cut -f1)
echo "File size is $fSizeMB, max size per new file is $maxSizeMB"

nChunks=$(( fSizeMB / maxSizeMB ))
echo "Want $nChunks chunks"

nRows=$(wc -l "$csvFile" | cut -d' ' -f2)
echo "File row count is $nRows"

nRowsPerChunk=$(( nRows / nChunks ))
echo "Need $nChunks files at around $nRowsPerChunk rows per file (plus one more file, maybe, for remainder)"


split -d -a 4 -l $nRowsPerChunk "$csvFile" "chunked_"


echo "Row (line) counts per file:"
wc -l chunked_00*

echo
echo "Size (MB) per file:"
du -ms chunked_00*

I created a mock CSV with 60_000_000 rows that is about 5GB:

ll -h gen_60000000x11.csv
-rw-r--r--  1 zyoung  staff   4.7G Jun 24 15:21 gen_60000000x11.csv

When I ran that script I got this output:

./main.sh gen_60000000x11.csv
File size is 4801MB, max size per new file is 125MB
Want 38 chunks
File row count is 60000000
Need 38 files at around 1578947 rows per file (plus one more file, maybe, for remainder)
Row (line) counts per file:
 1578947 chunked_0000
 1578947 chunked_0001
 1578947 chunked_0002
 ...
 1578947 chunked_0036
 1578947 chunked_0037
      14 chunked_0038
 60000000 total

Size (MB) per file:
129     chunked_0000
129     chunked_0001
129     chunked_0002
...
129     chunked_0036
129     chunked_0037
1       chunked_0038