Can I do a regex on a capturegroup in the same regex?-CodePudding

I have a regex (?i)(?<=srcset=\")([^\"] ) that matches the content of a srcset in the htmlsource of a page. It's working. Now what I would like is to extract all the urls from the capturegroup with something like http[^ ] - but I can only get one, no matter what I try.

Here is an example: (?i)(?<=srcset=\")((http[^\s] ).*?) (?=\")

The catch is: It's important that both operations is merged in one expression.

Any ideas?

sourcecode:

srcset="https://www.thesite.nl/wp-content/uploads/2016/02/logo-home.png 793w, https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-300x201.png 300w, https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-768x514.png 768w, https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-700x469.png 700w"

CodePudding user response：

With xidel, a proper HTML parser, using xpath:

xidel -e 'tokenize(/img/@srcset, "\s \w ,?\s*")' -s file.html

Output

https://www.thesite.nl/wp-content/uploads/2016/02/logo-home.png
https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-300x201.png
https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-768x514.png
https://www.thesite.nl/wp-content/uploads/2016/02/logo-home-700x469.png

CodePudding user response：

You can use the \G anchor to chain matches if supported:

(?i)(?:\G(?!^)|srcset=")[^"]*?\b(https?:[^"\s,] )

See this demo at regex101

(?:\G(?!^)|srcset=") either matches srcset=" or continues at previous match
[^"]*? matches any amount of non double quotes lazyily (between the links)
\b(https?:[^"\s,] ) capture links to first group starting at a word boundary