I have a text variable which contains multiple images with a relative or absolute path. I need to check if the src attribute starts with http
or https
then ignore it, but in case it starts with /
or something like abc/
then prepend a base url.
I tried like below:
<?php
$html = <<<HTML
<img src="docs/relative/url/img.jpg" />
<img src="/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />
HTML;
$base = 'https://example.com/';
$pattern = "/<img src=\"[^http|https]([^\"]*)\"/";
$replace = "<img src=\"" . $base . "\${1}\"";
echo $text = preg_replace($pattern, $replace, $html);
My output is:
<img src="https://example.com/ocs/relative/url/img.jpg" />
<img src="https://example.com/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />
Issue here: I got 99% result correct, but when the src attribute started with something like docs/
then first letter of it cut off. (please check first img src in output)
Output I needed is:
<img src="https://example.com/docs/relative/url/img.jpg" /><!--check this and compare with current result, you will get the difference -->
<img src="https://example.com/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />
Could any one help me to rectify it.
CodePudding user response:
The following pattern will seek src
attributes that do not start with http
or https
. Then for relative paths that begin with a forward slash, the leading slash will be removed before prepending the $base
string to the src
value.
Code: (Demo)
$base = 'https://example.com/';
echo preg_replace('~ src="(?!http)\K/?~', $base, $html);
Output:
<img src="https://example.com/docs/relative/url/img.jpg" />
<img src="https://example.com/docs/relative/url/img.jpg" />
<img src="https://docs/relative/url/img.jpg" />
<img src="http://docs/relative/url/img.jpg" />
Breakdown:
~ #starting pattern delimiter
src=" #match space, s, r, c, =, then "
(?!http) #only continue matching if not https or http
\K #forget any previously matched characters so they are not destroyed by the replacement string
/? #optionally match a forward slash
~ #ending pattern delimiter
As for your pattern, /<img src=\"[^http|https]([^\"]*)\"/
:
[^http|https]
actually means "match a single character that is not from this list:|
,h
,t
,p
, ands
. It could be simplified to[^|hpst]
because the order of the listed characters in the "negated character class" is irrelevant and duplicating characters is meaningless. So you see,[^...]
is not how you say "a string starts with something or somethingelse".- Capturing all remaining characters in a substring until the next double quote with the intent to use it again in the replacement is unnecessary. This is why I use
\K
to pinpoint where$base
should be injected instead of([^\"]*)
.
Furthermore, I always recommend the stability of a DOM parser when dealing with a valid HTML document. You can use DOMDocument with XPath to target the qualifying elements and modify the src
attributes without regex.
Code: (Demo)
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//img[not(starts-with(@src, 'http'))]") as $node) {
$node->setAttribute('src', $base . ltrim($node->getAttribute('src'), '/'));
}
echo $dom->saveHTML();
A related answer: https://stackoverflow.com/a/48837947/2943403