Home > Enterprise >  How to use Regex for replacing all URLs that are inside srcset of an img tag?
How to use Regex for replacing all URLs that are inside srcset of an img tag?

Time:05-22

I'm trying to prefix all URLs on the srcset of an img tag.
*I'll use php (or maybe Python), but for now I mostly need to catch them first in a group.

For example, I want to turn this line:

<img attr1="abc" src="//img1.jpg" srcset="//img2.png 480w, //img3.png 800w, //img4.png" attr2="cdf">

Into this line:

<img attr1="abc" src="//img1.jpg" srcset="PREFIX//img2.png 480w, PREFIX//img3.png 800w, PREFIX//img4.png" attr2="cdf">

*I've been playing with it for some time and it's either I'm getting only the first URL or all of them as one string or it's splitting the matches to different groups...

** UPDATE (just to clarify) **
I need a generic solution, not something that will work only on my specific example.
Also - I want to get the URLs separated on a match group. I don't want to get the contents of the srcset and then keep processing it.
It needs it to work with something like: preg_replace($regex, 'PREFIX$1', $html); or re.sub(regex, r'PREFIX\1', html)

-- Any idea would be appreciated.

CodePudding user response:

Well, assuming the form is always srcset=" and that , are only found in the srcset attribute you could use a 'good enough' solution like:

$html = '<img attr1="abc" src="//img1.jpg" srcset="//img2.png 480w, //img3.png 800w, //img4.png" attr2="cdf">';

$regex = '/((?:(?<=\bsrcset=)"|,)\s*)([^\s",] )/';

echo preg_replace($regex, '$1PREFIX$2', $html);

# <img attr1="abc" src="//img1.jpg" srcset="PREFIX//img2.png 480w, PREFIX//img3.png 800w, PREFIX//img4.png" attr2="cdf">

CodePudding user response:

Doing it in Python could look like this:

import re
html = '<img attr1="abc" src="//img1.jpg" srcset="//img2.png 480w, //img3.png 800w, //img4.png" attr2="cdf">'
srcset = re.findall(r'srcset="([^"] )"', html)[0]
prefixed = srcset.replace("//", "PREFIX//")
html = html.replace(srcset, prefixed)
print(html)

Outputs:

<img attr1="abc" src="//img1.jpg" srcset="PREFIX//img2.png 480w, PREFIX//img3.png 800w, PREFIX//img4.png" attr2="cdf">

Explanation:

The regex searches a string starting with srcset=" and matches all following chars except of ". The result is stored within srcset. Within srcset you can insert your PREFIX and then change your line of HTML code.

I really love unjustified downvotes :joy:

CodePudding user response:

Since you are dealing with html, it's always a good idea to work with a proper html parser. You won't need regex, though some string manipulation is required:

With python:

import lxml.html as lh

pics = """[your html above]"""
doc = lh.fromstring(pics)

#use xpath to locate your targets
imgs = doc.xpath('//img[@srcset]')[0]
srcs = imgs.xpath('@srcset')[0]

#prepare the new srcset attribute values
futuresrcs = []
for targ in srcs.split(','):
   newstring = targ.replace('/', 'PREFIX/', 1)
   futuresrcs.append(newstring)

#modify the attribute value
imgs.attrib['srcset']=",".join(futuresrcs)

print(lh.tostring(doc).decode())

It's a little more complex in php, but not by much:

$html = '[your html above]
';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);

$imgs = $xpath->query("//img[@srcset]")[0];
$srcs = $xpath->query(".//@srcset",$imgs)[0]->value;

$targs = explode(",",$srcs);
$futuresrcs = array();

foreach ($targs as $targ) {
    $pos = strpos($targ, "/");
    $newstring = substr_replace($targ, 'PREFIX/', $pos, 1);
    array_push($futuresrcs,$newstring);
};
$newsrcs = implode(",", $futuresrcs);
$imgs->setAttribute('srcset', $newsrcs);
echo $doc->saveHTML();

In either case, the output should be your expected output.

  • Related