Get wget to download only new items from a list-CodePudding

I've got a file that contains a list of file paths. I’m downloading them like this with wget:

wget -i cram_download_list.txt

However the list is long and my session gets interrupted. I’d like to look at the directory for which files already exist, and only download the outstanding ones.

I’ve been trying to com up with an option involving comm, but can’t work out how to loop it in with wget.

File contents look like this:

ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239280/NA07037.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239286/NA11829.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239293/NA11918.final.cram
ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239298/NA11994.final.cram

I’m currently trying to do something like this:

ls *.cram | sed 's/^/ftp:\/\/ftp.sra.ebi.ac.uk\/vol1\/run\/ERR323\/ERR3239480\//' > downloaded.txt
comm -3 <(sort cram_download_list.txt) <(sort downloaded.txt) | tr -d " \t" > to_download.txt
wget -i to_download_final.txt

CodePudding user response：

wget -c -i <(find -type f -name '*.cram' -printf '%f$\n' |\
             grep -vf - cram_download_list.txt )

Finds files ending in cram and prints them followed by a $ and a newline. This is used as for an inverted regex match list for your download list, i.e. removes any lines ending in the existing file names from your download list.

Added: -c for finalizing incomplete files (i.e. resume download)

Note: does not handle spaces or newlines in file names well, but these are ftp-URLs so that should not be a problem in the first place.

CodePudding user response：

If you also want to handle partial transferred files, you always need to pass in the complete set of filenames that wget is able to check the length. Which means that for this scenario the only way is:

wget -c -i cram_download_list.txt

The files which are already completed will only be checked and skipped.