Home > database >  Javascript problem with finding html elements from source code of another website
Javascript problem with finding html elements from source code of another website

Time:03-17

I am having trouble finding individual html elements from the downloaded source code of a selected page. When I use the function $(data).find('p').length it returns me the number 2 which is the correct answer, but if I use the function $(data).find('img').length it returns me 0 and it should be 1.

async function getErrors() {
    await $.ajax({
            url: 'http://example.com',
            method: 'get'
        })
        .done(async (siteText) => {
            var data = $.parseHTML(siteText);
            console.log(data);
            console.log($(data).find('p').length);
            console.log($(data).find('img').length);
             await axios.get('http://anothersite.com')
            .then((response) => {
                //do something...
            });
        });
}

Live example:

var siteText = `<!DOCTYPE html>
<html lang="pl">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Test Site</title>
    <style>
        .black{
            background-color: black;
            color: #333131;
        }
    </style>
</head>
<body>
    <h1>Strona Testowa</h1>
    <div>
        <h2>Lorem Ipsum</h2>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Convallis aenean et tortor at risus. Pellentesque habitant morbi tristique senectus. Nisi est sit amet facilisis. Vel elit scelerisque mauris pellentesque pulvinar. Quisque egestas diam in arcu. Elit at imperdiet dui accumsan sit amet nulla. Urna porttitor rhoncus dolor purus non enim praesent elementum. Velit dignissim sodales ut eu sem integer vitae justo eget. Lacus suspendisse faucibus interdum posuere lorem. Et ultrices neque ornare aenean euismod. Porttitor eget dolor morbi non. Sit amet consectetur adipiscing elit. Amet nisl suscipit adipiscing bibendum est. Eu non diam phasellus vestibulum. Neque convallis a cras semper auctor. Risus at ultrices mi tempus imperdiet nulla malesuada pellentesque elit. Et molestie ac feugiat sed lectus vestibulum. Adipiscing diam donec adipiscing tristique risus nec. Imperdiet proin fermentum leo vel. Nibh mauris cursus mattis molestie a iaculis at erat pellentesque. Elementum integer enim neque volutpat ac tincidunt vitae semper. Nam libero justo laoreet sit. Nibh tortor id aliquet lectus proin nibh nisl condimentum id. Et sollicitudin ac orci phasellus egestas tellus. Nunc sed augue lacus viverra vitae congue eu. Dui vivamus arcu felis bibendum ut. Mattis nunc sed blandit libero volutpat sed. Commodo sed egestas egestas fringilla phasellus faucibus scelerisque eleifend. Velit aliquet sagittis id consectetur purus ut faucibus pulvinar elementum. Quam vulputate dignissim suspendisse in est ante in nibh. Accumsan sit amet nulla facilisi morbi. Ac ut consequat semper viverra. Viverra tellus in hac habitasse platea dictumst. Donec ultrices tincidunt arcu non sodales neque. In est ante in nibh mauris. Mattis enim ut tellus elementum sagittis. Consectetur adipiscing elit pellentesque habitant morbi tristique senectus et netus. Sed id semper risus in. Vestibulum lectus mauris ultrices eros in cursus turpis massa. Vitae tempus quam pellentesque nec nam aliquam sem et tortor. In arcu cursus euismod quis viverra nibh cras. Sit amet consectetur adipiscing elit duis tristique. Augue ut lectus arcu bibendum at varius vel pharetra vel. Pharetra magna ac placerat vestibulum lectus mauris ultrices eros in. Libero nunc consequat interdum varius sit amet mattis vulputate. Netus et malesuada fames ac. In pellentesque massa placerat duis ultricies lacus sed turpis tincidunt. Tellus in hac habitasse platea dictumst vestibulum rhoncus est pellentesque. Duis convallis convallis tellus id interdum velit laoreet. Et tortor consequat id porta nibh venenatis cras. Laoreet sit amet cursus sit amet dictum sit amet justo.</p>
    </div>
    <img src="https://png.pngtree.com/png-clipart/20190108/ourmid/pngtree-tree-green-plant-photography-png-png-image_305004.jpg" >
    <iframe width="560" height="315" src="https://www.youtube.com/embed/gK8s4LUJ7NE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
    <div >
        <p >Lorem Ipsum</p>
    </div>
</body>
</html>`;

var data = $.parseHTML(siteText);
console.log(data);
console.log($(data).find('p').length);
console.log($(data).find('img').length);
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

CodePudding user response:

I tried with your code with another site and that's working fine. I modified your JS to temporary get rid of async/await:

$.ajax({
    url: 'http://jsfiddle.net/2AaFk/1',
    method: 'get'
})
.done((siteText) => {
    console.log(siteText);
    var data = $.parseHTML(siteText);
    //console.log(data);
    console.log($(data).find('h3').length);
});

PS: if this does not work for you, please leave a comment. please do not downvote this.

CodePudding user response:

I was able to find a solution to this problem. If you want to find img for example you need to use the filter() function, instead of find(). I hope this will be useful for someone as well.

CodePudding user response:

As an alternative you could use the html() function on a newly created element to parse your HTML. This way the find() function works because it looks for child elements of the new element.

Detailed explenation:

HTML parsed by parseHTML() and html() will ignore <html>, <head> and <body> tags.

So the parsing returns an array of the nodes in the head and body so the find() function runs on every element in that array when wrapped in a jQuery object directly. That's why find() can't find the direct children of <body>. The filter() function works because it filters the array.

By wrapping the result in a new element the find function will work correctly on the full <body> content since they are now children of the new element.

var siteText = `<!DOCTYPE html>
<html lang="pl">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Test Site</title>
    <style>
        .black{
            background-color: black;
            color: #333131;
        }
    </style>
</head>
<body>
    <h1>Strona Testowa</h1>
    <div>
        <h2>Lorem Ipsum</h2>
        <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Convallis aenean et tortor at risus. Pellentesque habitant morbi tristique senectus. Nisi est sit amet facilisis. Vel elit scelerisque mauris pellentesque pulvinar. Quisque egestas diam in arcu. Elit at imperdiet dui accumsan sit amet nulla. Urna porttitor rhoncus dolor purus non enim praesent elementum. Velit dignissim sodales ut eu sem integer vitae justo eget. Lacus suspendisse faucibus interdum posuere lorem. Et ultrices neque ornare aenean euismod. Porttitor eget dolor morbi non. Sit amet consectetur adipiscing elit. Amet nisl suscipit adipiscing bibendum est. Eu non diam phasellus vestibulum. Neque convallis a cras semper auctor. Risus at ultrices mi tempus imperdiet nulla malesuada pellentesque elit. Et molestie ac feugiat sed lectus vestibulum. Adipiscing diam donec adipiscing tristique risus nec. Imperdiet proin fermentum leo vel. Nibh mauris cursus mattis molestie a iaculis at erat pellentesque. Elementum integer enim neque volutpat ac tincidunt vitae semper. Nam libero justo laoreet sit. Nibh tortor id aliquet lectus proin nibh nisl condimentum id. Et sollicitudin ac orci phasellus egestas tellus. Nunc sed augue lacus viverra vitae congue eu. Dui vivamus arcu felis bibendum ut. Mattis nunc sed blandit libero volutpat sed. Commodo sed egestas egestas fringilla phasellus faucibus scelerisque eleifend. Velit aliquet sagittis id consectetur purus ut faucibus pulvinar elementum. Quam vulputate dignissim suspendisse in est ante in nibh. Accumsan sit amet nulla facilisi morbi. Ac ut consequat semper viverra. Viverra tellus in hac habitasse platea dictumst. Donec ultrices tincidunt arcu non sodales neque. In est ante in nibh mauris. Mattis enim ut tellus elementum sagittis. Consectetur adipiscing elit pellentesque habitant morbi tristique senectus et netus. Sed id semper risus in. Vestibulum lectus mauris ultrices eros in cursus turpis massa. Vitae tempus quam pellentesque nec nam aliquam sem et tortor. In arcu cursus euismod quis viverra nibh cras. Sit amet consectetur adipiscing elit duis tristique. Augue ut lectus arcu bibendum at varius vel pharetra vel. Pharetra magna ac placerat vestibulum lectus mauris ultrices eros in. Libero nunc consequat interdum varius sit amet mattis vulputate. Netus et malesuada fames ac. In pellentesque massa placerat duis ultricies lacus sed turpis tincidunt. Tellus in hac habitasse platea dictumst vestibulum rhoncus est pellentesque. Duis convallis convallis tellus id interdum velit laoreet. Et tortor consequat id porta nibh venenatis cras. Laoreet sit amet cursus sit amet dictum sit amet justo.</p>
    </div>
    <img src="https://png.pngtree.com/png-clipart/20190108/ourmid/pngtree-tree-green-plant-photography-png-png-image_305004.jpg" >
    <iframe width="560" height="315" src="https://www.youtube.com/embed/gK8s4LUJ7NE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
    <div >
        <p >Lorem Ipsum</p>
    </div>
</body>
</html>`;

var data = $('<div></div>').html(siteText);

console.log(data.find('p').length);
console.log(data.find('img').length);
console.log(data.html());
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div id="target"></div>

  • Related