scraping website using python or wget - awaiting response problem-CodePudding

I am trying to download more than 300k HTML files from the same server. I have the URLs in a list/text file. My first attempt was to use python urllib/requests but it was incredibly slow and would get stuck after a few links (10-20). Code example:

for i, url in enumerate(url_list):
    urllib.request.urlretrieve(url, "./pages/" str(i))

Then I tried simply using wget like this:

wget -i links_file.txt -U netscape

wget works great and it downloads between 1-5 k files without problems and seems very fast but then gets stuck at random (?) files:

Connecting to <website>... connected. HTTP request sent, awaiting response...

Now I can see at which URL it got stuck and just stop the run and start it again from the same point and it works perfectly fine again for another 1-5k downloads. Since I can't do this manually every time it gets stuck until I finally have all 300k files, I was wondering if there is a way how I can stop wget automatically if it awaits a response for too long and then just tries again? Or is there any other/better way to download this many HTML files automatically?

CodePudding user response：

how I can stop wget automatically if it awaits a response for too long and then just tries again?

What you are looking for is called timeout and number of retries. In wget you might use --timeout to set all timeouts at once or use specific timeouts that is

--dns-timeout
--connect-timeout
--read-timeout

In either case you should provide value as number of seconds after = for example --timeout=60

Use --tries to set number of retries (default: 20), for example --tries=10 but keep in mind that in case of fatal error no retry is done.

You might also find useful --no-clobber which effect is that file is not downloaded if already exist such named file (which would be overwritten)