Home > Blockchain >  Grabbing specific elements inside DIV from external page
Grabbing specific elements inside DIV from external page

Time:09-07

I need to scrap the following elements inside each one of these div's (page contains several of them), but in fact I have no clue how to do it... so, I need help not to pull my hair out.

1 - The link and image inside the div: https://...this_link" > (just need the link)

<img width="300" height="300" src="https://...this_image_url... (just need this image URL)

2 - The title inside the h3 tag as follows;

<h3 ><a href="https://...linkhere">The title goes here (just the title)

3 - Last but not least, I need to grab tha price inside this;

<span ><span ><bdi><span >€</span>20,00</bdi></span></span> (just the "€20.00")

Here's the full HTML:

<div  data-loop="1">

<div >
    <a href="https://...linkhere" >
        <img width="300" height="300" src="https://image-goes-here.jpg" >    </a>
    
    <div >

        <h3 ><a href="https://...linkhere">The title goes here</a></h3>        
                
        
    <span ><span ><bdi><span >€</span>20,00</bdi></span></span>

        <div >
            <a href="https://...linkhere" data-quantity="1" ><span>Options</span></a></div> 
    </div>

    <div >
                            <div >
                <a href="https://...linkhere" data-added-text="Compare Products">Buy</a>
            </div>
    <div >
                <a href="https://...linkhere" >quick view</a>
            </div>
                            <div >
                <a  href="https://linkhere/wishlist/" data-key="dcf36756534755" data-product-id="387654" data-added-text="See Wishlist">Wishlist</a>
            </div>
            </div>
                <div >
                <div ><a href="#" rel="nofollow noopener">Close</a></div>
                <div >
                </div>
            </div>
        </div>
</div>

One of my clumsy attempts:

$html = file_get_contents("https://url-here.goetohere");
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'product-grid-item';
$classname = 'product-element-top2';
$classname = 'product-element-top2';
$classname = 'wd-entities-title';
$classname = 'price';
$nodes = $finder->query("//*[contains(@class, '$classname')]");
foreach ($nodes as $node) {
    echo 'here »» ' . htmlentities($node->nodeValue) . '<br>';
}

CodePudding user response:

Assuming that the HTML is being fetched correctly prior to attempting any DOM processing then it is fairly straightforward to construct some basic XPath expressions to find the indicated content.

As per the comment page contains several of them there are 2 product-grid-item divs as you'll note in the output.

$html='
    <div  data-loop="1">
        <div >
            <a href="https://...linkhere" >
                <img width="300" height="300" src="https://image-goes-here.jpg" >
            </a>
            <div >
                <h3 >
                    <a href="https://...linkhere">The title goes here</a>
                </h3>
                <span >
                    <span >
                        <bdi>
                            <span >€</span>20,00
                        </bdi>
                    </span>
                </span>
                <div >
                    <a href="https://...linkhere" data-quantity="1" >
                        <span>Options</span>
                    </a>
                </div> 
            </div>

            <div >
                <div >
                    <a href="https://...linkhere" data-added-text="Compare Products">Buy</a>
                </div>
                <div >
                    <a href="https://...linkhere" >quick view</a>
                </div>
                <div >
                    <a  href="https://linkhere/wishlist/" data-key="dcf36756534755" data-product-id="387654" data-added-text="See Wishlist">Wishlist</a>
                </div>
            </div>
            <div >
                <div >
                    <a href="#" rel="nofollow noopener">Close</a>
                </div>
                <div ></div>
            </div>
        </div>
    </div>
    
    <div  data-loop="1">
        <div >
            <a href="https://www.example.com/banana" >
                <img width="300" height="300" src="https://www.example.com/kittykat.jpg" >
            </a>
            <div >
                <h3 >
                    <a href="https://www.example.com/womble">Oh look, another title!</a>
                </h3>
                <span >
                    <span >
                        <bdi>
                            <span >€</span>540,00
                        </bdi>
                    </span>
                </span>
                <div >
                    <a href="https://www.example.com/gorilla" data-quantity="1" >
                        <span>Options</span>
                    </a>
                </div> 
            </div>

            <div >
                <div >
                    <a href="https:www.example.com/buy" data-added-text="Compare Products">Buy</a>
                </div>
                <div >
                    <a href="https://www.example.com/view" >quick view</a>
                </div>
                <div >
                    <a  href="https://www.example.com/wishlist/" data-key="dcf36756534755" data-product-id="387654" data-added-text="See Wishlist">Wishlist</a>
                </div>
            </div>
            <div >
                <div >
                    <a href="#" rel="nofollow noopener">Close</a>
                </div>
                <div ></div>
            </div>
        </div>
    </div>';

To process the downloaded HTML

# set the libxml parameters and create new DOMDocument/XPath objects.
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->strictErrorChecking=false;
$dom->recover=true;
$dom->loadHTML( $html );
libxml_clear_errors();

$xp=new DOMXPath( $dom );

# some basic XPath expressions
$exprs=(object)array(
    'product-link'      =>  '//a[@]',
    'product-img-src'   =>  '//a[@]/img',
    'h3-title-text'     =>  '//h3[@]',
    'price'             =>  '//span[@]/span/bdi'
);
# find the keys (for convenience) to be used below
$keys=array_keys( get_object_vars( $exprs ) );

# store results here
$res=array();

# loop through all patterns and issue XPath query.
foreach( $exprs as $key => $expr ){
    # add key to output and set as an array.
    $res[ $key ]=[];
    $col=$xp->query( $expr );
    
    # find the data if the query succeeds
    if( $col && $col->length > 0 ){
        foreach( $col as $node ){
            switch( $key ){
                case $keys[0]:$res[$key][]=$node->getAttribute('href');break;
                case $keys[1]:$res[$key][]=$node->getAttribute('src');break;
                case $keys[2]:$res[$key][]=trim($node->textContent);break;
                case $keys[3]:$res[$key][]=trim($node->textContent);break;
            }
        }
    }
}
# show the result or do really interesting things with the data
printf('<pre>%s</pre>',print_r($res,true));

Which yields:

Array
(
    [product-link] => Array
        (
            [0] => https://...linkhere
            [1] => https://www.example.com/banana
        )

    [product-img-src] => Array
        (
            [0] => https://image-goes-here.jpg
            [1] => https://www.example.com/kittykat.jpg
        )

    [h3-title-text] => Array
        (
            [0] => The title goes here
            [1] => Oh look, another title!
        )

    [price] => Array
        (
            [0] => â¬20,00
            [1] => â¬540,00
        )

)
  • Related