Home > other >  Bash script execute wget with an input inside url
Bash script execute wget with an input inside url

Time:06-25

Complete newbie, I know you probably can't use the variables like that in there but I have 20 minutes to deliver this so HELP

read -r -p "Month?: " month
read -r -p "Year?: " year

URL= "https://gz.blockchair.com/ethereum/blocks/"

wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit

CodePudding user response:

There are two issues with your code.

First, you should remove the whitespace that follows the equal symbol when you declare your URL variable. So the line becomes

URL="https://gz.blockchair.com/ethereum/blocks/"

Then, you are building your URL using a wildcard, which is not supported in HTTP. So you cannot do something like month*.tsv.gz as you are doing right know. If you need to perform requests to several URLs, you need to run wget for each one of them.

CodePudding user response:

It's possible to do what you're trying to do with wget, however, this particular site's robots.txt has a rule to disallow crawling of all files (https://gz.blockchair.com/robots.txt):

User-agent: *
Disallow: /

That means the site's admins don't want you to do this. wget respects robots.txt by default, but it's possible to turn that off with -e robots=off.

For this reason, I won't post a specific, copy/pasteable solution.

Here is a generic example for selecting (and downloading) files using a glob pattern, from a typical html index page:

url=https://www.example.com/path/to/index

wget \
--wait 2 --random-wait --limit=20k \
--recursive --no-parent --level 1 \
--no-directories \
-A "file[0-9][0-9]" \
"$url"
  • This would download all files named file, with a two digit suffix (file52 etc), that are linked on the page at $url, and whose parent path is also $url (--no-parent).

  • This is a recursive download, recursing one level of links (--level 1). wget allows us to use patterns to accept or reject filenames when recursing (-A and -R for globs, also --accept-regex, --reject-regex).

  • Certain sites block may block the wget user agent string, it can be spoofed with --user-agent.

  • Note that certain sites may ban your IP (and/or add it to a blacklist) for scraping, especially doing it repeatedly, or not respecting robots.txt.

  • Related