I’m trying to grab the url of the cover image for each article (10 articles on the first page) on this website: www.coindesk.com/tag/bitcoin/1/
but the following XPath
'//div[@]//img/@src'
gives me 20 results of 2 of the same url when I only want 10 of them. How can I take care of this problem so I don’t get duplicates?
Thank you in advance for your help!
CodePudding user response:
With XPath 2.0:
distinct-values(//div[@]//img/@src)
CodePudding user response:
In a shell (if you are stuck with XPath1
):
$ xidel -se '//div[@]//img/@src' \
https://www.coindesk.com/tag/bitcoin/1/ |
sort -u
If you use another language, the idea is to create an array and filter duplicates.
CodePudding user response:
It's because each article has 2 //div[@]
-nodes. Please look at the parent- or ancestor-node for a unique identifier. The following 4 queries should all return the same output:
//div[starts-with(@class,"display-desktop-block")]/div/div/a/picture/img/@src
//div[starts-with(@class,"display-desktop-block")]//@src
//div[@]/div/a/picture/img/@src
//div[@]//@src
These return the urls to the lowest resolution image variant. If you want the highest resolution and best quality instead (with xidel):
$ xidel -s "http://www.coindesk.com/tag/bitcoin/1" -e '
//div[starts-with(@class,"display-desktop-block")]/div/div/a/picture/(source[@type="image/webp"])[last()]/tokenize(tokenize(@srcSet,", ")[2])[1]
'
$ xidel -s "http://www.coindesk.com/tag/bitcoin/1" -e '
//div[@]/div/a/picture/(source[@type="image/webp"])[last()]/tokenize(tokenize(@srcSet,", ")[2])[1]
'