Hii i'm trying to scrap href link from a tag using regex, but i'm unable to retrieve link can someone help me to achieve this here is the link which i tring to extract from html page. /u/0/uc?export=download&confirm=EY_S&id=fileid
Here is my php function
<?php
function dwnload($url)
{
$scriptx = "";
$internalErrors = libxml_use_internal_errors(true);
$dom = new DOMDocument();
@$dom->loadHTML(curl($url));
foreach ($dom->getElementsByTagName('a') as $k => $js) {
$scriptx .= $js->nodeValue;
}
preg_match_all('#\bhttps?://[^,\s()<>] (?:\([\w\d] \)|([^,[:punct:]\s]|/))#', $scriptx, $match);
$vlink = "";
foreach ($match[0] as $c) {
if (strpos($c, 'export=download') !== false) {
$vlink = $c;
}
}
return $vlink;
}?>
Thanks
CodePudding user response:
You're concatenating the link texts. That does not make sense. If you try to extract links, DOMNode::getElementsByTagName()
does the job already. You just need to filter the results.
Let's consider a small HTML fragment:
$html = <<<'HTML'
<a href="/u/0/uc?export=download&confirm=EY_S&id=fileid">SUCCESS</a>
<a href="/another/link">FAILURE</a>
HTML;
Now iterate the a
elements and filter them by their href
attribute.
$document = new DOMDocument();
$document->loadHTML($html);
foreach ($document->getElementsByTagName('a') as $a) {
$href = $a->getAttribute('href');
if (strpos($href, 'export=download') !== false) {
var_dump([$href, $a->textContent]);
}
}
Output:
array(2) {
[0]=>
string(46) "/u/0/uc?export=download&confirm=EY_S&id=fileid"
[1]=>
string(7) "SUCCESS"
}
Now if this is a string match it is possible to use an Xpath expression:
$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);
foreach ($xpath->evaluate('//a[contains(@href, "export=download")]') as $a) {
var_dump([$a->getAttribute('href'), $a->textContent]);
}
Or combine the Xpath expression with an more specific regular expression:
$pattern = '((?:\\?|&)export=download(?:&|$))';
foreach ($xpath->evaluate('//a[contains(@href, "export=download")]') as $a) {
$href = $a->getAttribute('href');
if (preg_match($pattern, $href)) {
var_dump([$href, $a->textContent]);
}
}