I have a regex (?i)(?<=srcset=\")([^\"] )
that matches the content of a srcset in the htmlsource of a page. It's working. Now what I would like is to extract all the urls from the capturegroup with something like http[^ ]
- but I can only get one, no matter what I try.
Here is an example: (?i)(?<=srcset=\")((http[^\s] ).*?) (?=\")
The catch is: It's important that both operations is merged in one expression.
Any ideas?
sourcecode:
srcset="https://www.thesite.nl/wp-content/uploads/2016/02/logo-home.png 793w, https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-300x201.png 300w, https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-768x514.png 768w, https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-700x469.png 700w"
CodePudding user response:
With xidel
, a proper HTML parser, using xpath:
xidel -e 'tokenize(/img/@srcset, "\s \w ,?\s*")' -s file.html
Output
https://www.thesite.nl/wp-content/uploads/2016/02/logo-home.png
https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-300x201.png
https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-768x514.png
https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-700x469.png
CodePudding user response:
You can use the \G
anchor to chain matches if supported:
(?i)(?:\G(?!^)|srcset=")[^"]*?\b(https?:[^"\s,] )
(?:\G(?!^)|srcset=")
either matchessrcset="
or continues at previous match[^"]*?
matches any amount of non double quotes lazyily (between the links)\b(https?:[^"\s,] )
capture links to first group starting at a word boundary