XPath selector giving me two duplicate results for each article-CodePudding

I’m trying to grab the url of the cover image for each article (10 articles on the first page) on this website: www.coindesk.com/tag/bitcoin/1/

but the following XPath

'//div[@]//img/@src'

gives me 20 results of 2 of the same url when I only want 10 of them. How can I take care of this problem so I don’t get duplicates?

Thank you in advance for your help!

CodePudding user response：

With XPath 2.0:

distinct-values(//div[@]//img/@src)

CodePudding user response：

In a shell (if you are stuck with XPath1):

$ xidel -se '//div[@]//img/@src' \
   https://www.coindesk.com/tag/bitcoin/1/ |
    sort -u

If you use another language, the idea is to create an array and filter duplicates.

CodePudding user response：

It's because each article has 2 //div[@]-nodes. Please look at the parent- or ancestor-node for a unique identifier. The following 4 queries should all return the same output:

//div[starts-with(@class,"display-desktop-block")]/div/div/a/picture/img/@src
//div[starts-with(@class,"display-desktop-block")]//@src

//div[@]/div/a/picture/img/@src
//div[@]//@src

These return the urls to the lowest resolution image variant. If you want the highest resolution and best quality instead (with xidel):

$ xidel -s "http://www.coindesk.com/tag/bitcoin/1" -e '
  //div[starts-with(@class,"display-desktop-block")]/div/div/a/picture/(source[@type="image/webp"])[last()]/tokenize(tokenize(@srcSet,", ")[2])[1]
'
$ xidel -s "http://www.coindesk.com/tag/bitcoin/1" -e '
  //div[@]/div/a/picture/(source[@type="image/webp"])[last()]/tokenize(tokenize(@srcSet,", ")[2])[1]
'