I am trying to scrape the images from a page which contains html similar to the following:
<img
data-srcset="/content/images/size/w400/img.jpeg 400w,
/content/images/size/w750/img.jpeg 750w,
/content/images/size/w960/img.jpeg 960w," data-sizes="auto"
src="/content/images/size/w750/img.jpeg"
srcset="data:image/gif;base64,R0lGODlhAQABAAAEALAAAAAABAARAAAICTBEAOw=="
>
I am using the command wget --page-requisites mydomain.com
.
It successfully downloads /content/images/size/w750/img.jpeg
but not the other two images in data-srcset
.
How can I use wget to download all the images?
CodePudding user response:
With xidel:
xidel -e 'tokenize(//img/@data-srcset, "\s \S ,") ! tokenize(.)' -s file.html |
xargs -I{} echo wget https://domain.tld{}
Output
Remove the echo
command if the output looks good enough:
wget https://domain.tld/content/images/size/w400/img.jpeg
wget https://domain.tld/content/images/size/w750/img.jpeg
wget https://domain.tld/content/images/size/w960/img.jpeg
CodePudding user response:
To get those images you'll need to actually parse that website. This can be done with the HTML-parser xidel:
$ xidel -s "<url>" -e 'tokenize(//img/@data-srcset,"\s \S ,\n?")[.]'
/content/images/size/w400/img.jpeg
/content/images/size/w750/img.jpeg
/content/images/size/w960/img.jpeg
Now just "follow" (with -f
/--follow
) these urls to get the content of these images.
Normally --download .
would be enough, but since all 3 images are named 'img.jpeg' you have to rename them:
$ xidel -s "<url>" \
-f 'tokenize(//img/@data-srcset,"\s \S ,\n?")[.]' \
--download 'extract($url,"w\d ").jpeg'
This should download 'w400.jpeg', 'w750.jpeg' and 'w960.jpeg' in the current dir.