I am trying to download more than 300k HTML files from the same server. I have the URLs in a list/text file. My first attempt was to use python urllib/requests but it was incredibly slow and would get stuck after a few links (10-20). Code example:
for i, url in enumerate(url_list):
urllib.request.urlretrieve(url, "./pages/" str(i))
Then I tried simply using wget like this:
wget -i links_file.txt -U netscape
wget works great and it downloads between 1-5 k files without problems and seems very fast but then gets stuck at random (?) files:
Connecting to <website>... connected. HTTP request sent, awaiting response...
Now I can see at which URL it got stuck and just stop the run and start it again from the same point and it works perfectly fine again for another 1-5k downloads. Since I can't do this manually every time it gets stuck until I finally have all 300k files, I was wondering if there is a way how I can stop wget automatically if it awaits a response for too long and then just tries again? Or is there any other/better way to download this many HTML files automatically?
CodePudding user response:
how I can stop wget automatically if it awaits a response for too long and then just tries again?
What you are looking for is called timeout and number of retries. In wget
you might use --timeout
to set all timeouts at once or use specific timeouts that is
--dns-timeout
--connect-timeout
--read-timeout
In either case you should provide value as number of seconds after =
for example --timeout=60
Use --tries
to set number of retries (default: 20), for example --tries=10
but keep in mind that in case of fatal error no retry is done.
You might also find useful --no-clobber
which effect is that file is not downloaded if already exist such named file (which would be overwritten)