How to scrape all pages of a website and get the meta description in php-CodePudding

I want to scrape all pages of a website and get the meta tag description like

<meta name="description" content="I want to get this description of this meta tag" />

similarly for all other pages I want to get their individual meta description

Here is my code

add_action('woocommerce_before_single_product', 'my_function_get_description');

function my_function_get_description($url) {
   $the_html = file_get_contents('https://tipodense.dk/');
   print_r($the_html)
}

Thisprint_r($the_html) gives me the whole website, I don't know how to get the meta description of each page

Kindly guide me thanks

CodePudding user response：

The better way to parse an HTML file is to use DOMDocument and, in many cases, combine that with DOMXPath to run queries on the DOM to find elements of interest.

For instance, in your case to extract the meta description you could do:

$url='https://tipodense.dk/';

# create the DOMDocument and load url
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->loadHTMLFile( $url );
libxml_clear_errors();

# load XPath
$xp=new DOMXPath( $dom );
$expr='//meta[@name="description"]';


$col=$xp->query($expr);
if( $col && $col->length > 0 ){
    foreach( $col as $node ){
        echo $node->getAttribute('content');
    }
}

Which yields:

Har du brug for at vide hvad der sker i Odense? Vores fokuspunkter er især events, mad, musik, kultur og nyheder. Hvis du vil vide mere så læs med på sitet.

Using the sitemap ( or part of it ) you could do like this:

$url='https://tipodense.dk/';
$sitemap='https://tipodense.dk/sitemap-pages.xml';

$urls=array();


# create the DOMDocument and load url
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->strictErrorChecking=false;
$dom->recover=true;


# read the sitemap & store urls
$dom->load( $sitemap );
libxml_clear_errors();

$col=$dom->getElementsByTagName('loc');
foreach( $col as $node )$urls[]=$node->nodeValue;



foreach( $urls as $url ){
    
    $dom->loadHTMLFile( $url );
    libxml_clear_errors();
    
    # load XPath
    $xp=new DOMXPath( $dom );
    $expr='//meta[@name="description"]';
    
    
    $col=$xp->query( $expr );
    if( $col && $col->length > 0 ){
        foreach( $col as $node ){
            printf('<div>%s: %s</div>', $url, $node->getAttribute('content') );
        }
    }
}

CodePudding user response：

You have to look about preg_match and regex expression. Here it's quite simple :

function my_function_get_description($url) {
    $the_html = file_get_contents('https://tipodense.dk/');
    preg_match('meta name="description" content="([\w\s] )"', $the_html, $matches);
    print_r($matches);
}

https://regex101.com/r/JMcaUh/1

The description is catched by capturing group () and saved in $matches[0][1]

EDIT : DOMDocument is a great solution too, but assuming you only want description, using regex looks easier to me !