wget --page-requisites not downloading all images in data-srcset-CodePudding

I am trying to scrape the images from a page which contains html similar to the following:

<img 
     data-srcset="/content/images/size/w400/img.jpeg 400w,
/content/images/size/w750/img.jpeg 750w,
/content/images/size/w960/img.jpeg 960w," data-sizes="auto"
       src="/content/images/size/w750/img.jpeg"
       srcset="data:image/gif;base64,R0lGODlhAQABAAAEALAAAAAABAARAAAICTBEAOw=="
>

I am using the command wget --page-requisites mydomain.com.
It successfully downloads /content/images/size/w750/img.jpeg but not the other two images in data-srcset. How can I use wget to download all the images?

CodePudding user response：

With xidel:

xidel -e 'tokenize(//img/@data-srcset, "\s \S ,") ! tokenize(.)' -s file.html |
    xargs -I{} echo wget https://domain.tld{}

Output

Remove the echo command if the output looks good enough:

wget https://domain.tld/content/images/size/w400/img.jpeg
wget https://domain.tld/content/images/size/w750/img.jpeg
wget https://domain.tld/content/images/size/w960/img.jpeg

CodePudding user response：

To get those images you'll need to actually parse that website. This can be done with the HTML-parser xidel:

$ xidel -s "<url>" -e 'tokenize(//img/@data-srcset,"\s \S ,\n?")[.]'
/content/images/size/w400/img.jpeg
/content/images/size/w750/img.jpeg
/content/images/size/w960/img.jpeg

Now just "follow" (with -f/--follow) these urls to get the content of these images.
Normally --download . would be enough, but since all 3 images are named 'img.jpeg' you have to rename them:

$ xidel -s "<url>" \
  -f 'tokenize(//img/@data-srcset,"\s \S ,\n?")[.]' \
  --download 'extract($url,"w\d ").jpeg'

This should download 'w400.jpeg', 'w750.jpeg' and 'w960.jpeg' in the current dir.