Bash script execute wget with an input inside url-CodePudding

Complete newbie, I know you probably can't use the variables like that in there but I have 20 minutes to deliver this so HELP

read -r -p "Month?: " month
read -r -p "Year?: " year

URL= "https://gz.blockchair.com/ethereum/blocks/"

wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit

CodePudding user response：

There are two issues with your code.

First, you should remove the whitespace that follows the equal symbol when you declare your URL variable. So the line becomes

URL="https://gz.blockchair.com/ethereum/blocks/"

Then, you are building your URL using a wildcard, which is not supported in HTTP. So you cannot do something like month*.tsv.gz as you are doing right know. If you need to perform requests to several URLs, you need to run wget for each one of them.

CodePudding user response：

It's possible to do what you're trying to do with wget, however, this particular site's robots.txt has a rule to disallow crawling of all files (https://gz.blockchair.com/robots.txt):

User-agent: *
Disallow: /

That means the site's admins don't want you to do this. wget respects robots.txt by default, but it's possible to turn that off with -e robots=off.

For this reason, I won't post a specific, copy/pasteable solution.

Here is a generic example for selecting (and downloading) files using a glob pattern, from a typical html index page:

url=https://www.example.com/path/to/index

wget \
--wait 2 --random-wait --limit=20k \
--recursive --no-parent --level 1 \
--no-directories \
-A "file[0-9][0-9]" \
"$url"

This would download all files named file, with a two digit suffix (file52 etc), that are linked on the page at $url, and whose parent path is also $url (--no-parent).
This is a recursive download, recursing one level of links (--level 1). wget allows us to use patterns to accept or reject filenames when recursing (-A and -R for globs, also --accept-regex, --reject-regex).
Certain sites block may block the wget user agent string, it can be spoofed with --user-agent.
Note that certain sites may ban your IP (and/or add it to a blacklist) for scraping, especially doing it repeatedly, or not respecting robots.txt.