Home > Mobile >  Keep Only subdirectory from href and src (ROOT html links )
Keep Only subdirectory from href and src (ROOT html links )

Time:11-03

Hello I have my code that copy the html from external url and echo it on my page. Some of the HTMLs have links and/or picure SRC inside. I will need some help to truncate them (from absolute url to relative url inside $data )

For example : inside html there is href

<a href="https://www.trade-ideas.com/products/score-vs-ibd/" >

or SRC

<img src="http://static.trade-ideas.com/Filters/MinDUp1.gif">

I would like to keep only subdirectory.

/products/score-vs-ibd/z

/Filters/MinDUp1.gif

Maybe with preg_replace , but im not familiar with Regular expressions.

This is my original code that works very well, but now im stuck truncating the links.

<?php
$post_tags = get_the_tags();
if ( $post_tags ) {
$tag = $post_tags[0]->name; 
}   
$html= file_get_contents('https://www.trade-ideas.com/ticky/ticky.html?symbol='. "$tag");

$start = strpos($html,'<div ');
$end =  strpos($html,'<!-- /span -->',$start);
$data= substr($html,$start,$end-$start);
echo $data ;
?>

CodePudding user response:

Here is the code:

function getUrlPath($url) {
   $re = '/(?:https?:\/\/)?(?:[^?\/\s] [?\/])(.*)/';
   preg_match($re, $url, $matches);
   return $matches[1];
}

Example: getUrlPaths("http://myassets.com:80/files/images/image.gif") returns files/images/image.gif

CodePudding user response:

You can locate all the URLs in the html string with a regex using preg_match_all().
The regex:

'/=[\'"](https?:\/\/.*?(\/.*))[\'"]/i'

will capture both the entire URL and the path/query string for every occurrence of ="http://domain/path" or ='https://domain/path?query' (http/https, single or double quotes, with/without query string).
Then you can just use str_replace() to update the html string.

<?php
$html = '<a href="https://www.trade-ideas.com/products/score-vs-ibd/" >
<img src="http://static.trade-ideas.com/Filters/MinDUp1.gif">
<img src=\'https://static.trade-ideas.com/Filters/MinDUp1.gif?param=value\'>';

$pattern = '/=[\'"](https?:\/\/.*?(\/.*))[\'"]/i';
$urls = [];
preg_match_all($pattern, $html, $urls);
//var_dump($urls);
foreach($urls[1] as $i => $uri){
    $html = str_replace($uri, $urls[2][$i], $html);
}
echo $html;

Run it live here.

Note, this will change all absolute URLs enclosed in quotes immediately following an =.

  • Related