Complete newbie, I know you probably can't use the variables like that in there but I have 20 minutes to deliver this so HELP
read -r -p "Month?: " month
read -r -p "Year?: " year
URL= "https://gz.blockchair.com/ethereum/blocks/"
wget -w 2 --limit-rate=20k "${URL}blockchair_ethereum_blocks_$year$month*.tsv.gz"
exit
CodePudding user response:
There are two issues with your code.
First, you should remove the whitespace that follows the equal symbol when you declare your URL
variable. So the line becomes
URL="https://gz.blockchair.com/ethereum/blocks/"
Then, you are building your URL using a wildcard, which is not supported in HTTP. So you cannot do something like month*.tsv.gz
as you are doing right know. If you need to perform requests to several URLs, you need to run wget for each one of them.
CodePudding user response:
It's possible to do what you're trying to do with wget
, however, this particular site's robots.txt has a rule to disallow crawling of all files (https://gz.blockchair.com/robots.txt):
User-agent: *
Disallow: /
That means the site's admins don't want you to do this. wget
respects robots.txt
by default, but it's possible to turn that off with -e robots=off
.
For this reason, I won't post a specific, copy/pasteable solution.
Here is a generic example for selecting (and downloading) files using a glob pattern, from a typical html index page:
url=https://www.example.com/path/to/index
wget \
--wait 2 --random-wait --limit=20k \
--recursive --no-parent --level 1 \
--no-directories \
-A "file[0-9][0-9]" \
"$url"
This would download all files named
file
, with a two digit suffix (file52
etc), that are linked on the page at$url
, and whose parent path is also$url
(--no-parent
).This is a recursive download, recursing one level of links (
--level 1
).wget
allows us to use patterns to accept or reject filenames when recursing (-A
and-R
for globs, also--accept-regex
,--reject-regex
).Certain sites block may block the
wget
user agent string, it can be spoofed with--user-agent
.Note that certain sites may ban your IP (and/or add it to a blacklist) for scraping, especially doing it repeatedly, or not respecting
robots.txt
.