Home > Net >  Truncate file in linux with wget download active
Truncate file in linux with wget download active

Time:08-10

i have background download with wget, i want, when the file is greater than 20mb, truncate the first 10mb of file, i have created this script:

if [ $filesize -ge $maxSize ]; then
            echo "Truncate.."
            kill -STOP $pidDwn
            fallocate -c -o 0 -l 10M $fileName
            kill -CONT $pidDwn
fi

this snippet recognize file size and truncate it for 10MB from beginning of file.

I stop wget progess, use fallocate for delete first 10MB, and after i resume the wget process for continue the download. The problem is strange, if the file size is 20mb and i use fallocate WITHOUT resume the process wget, the file remains 10mb, but if i resume the wget process the file return to 20mb instantly and continue to increase with download. If i use this command after resume the pid sed -i 1d $fileName the file remains 10mb but not increase anymore with the download, seems like download is interrupted, but wget process still alive if i use ps aux for see all process active

any idea for fix it?

CodePudding user response:

truncate the first 10mb of file

If server from which you are downloading said file support Partial Content gimmick then you might request Range (part) of file. First check if server support that feature do

wget --spider --server-response <url_to_resource>

if what was printed contain

Accept-Ranges: bytes

then it does and you might request file starting from, say 100th byte using Range header that is

wget --header "Range: bytes=100-" <url_to_resource>

You should then get response with code 206 and download will start.

CodePudding user response:

fallocate (or any other program) won't be able to modify the SEEK position that wget is currently using for writing to the file, so you can't do it that way.

A possible work-around would be to use tail for getting rid of all but the last 20MiB of the file:

wget 'https://somewhere/somefile.pdf' -O - |
tail -c "$((20 * 1024 * 1024))" > last20MiB.out

remark: Frankly, I don't know what you'll be able to do with a file whose header have been stripped.


Update

What you're trying to do isn't trivial because it requires a low-level file API, but here's a solution with perl.

The perl program writes the last 20 MiB of the stream to a file for every 10 MiB of downloaded data; the first write happens after 30 MiB of input data.

curl 'https://somewhere/somefile.pdf' |
perl -e '
    use bytes;
    $max_file_size = 20 * 1024 * 1024; #=> 20 MiB
    $buffer = "";
    $buffer_size = 0;
    $pending_bytes = false;
    while ($bytes = read(STDIN, $data, 1048576)) {
        $buffer .= $data;
        $buffer_size  = $bytes;
        $pending_bytes = true;
        if ($buffer_size >= 1.5 * $max_file_size) {
            $buffer = substr $buffer, ($buffer_size - $max_file_size), $max_file_size;
            $buffer_size = $max_file_size;
            open(FH, ">:raw", $ARGV[0]);
            print FH $buffer;
            close(FH);
            $pending_bytes = false;
        }
    }
    if ($pending_bytes) {
        open(FH, ">:raw", $ARGV[0]);
        print FH substr($buffer, ($buffer_size > $max_file_size ? $buffer_size - $max_file_size : 0), $max_file_size);
        close(FH);
    }
' last20MiB.out

remark: I tried to use truncate FH, 0 instead of opening/closing the output file for each write but it doesn't work...

  • Related