Home > front end >  Web Scraping NodeJs - How to recover resources when the page loads in full after several requests
Web Scraping NodeJs - How to recover resources when the page loads in full after several requests

Time:10-26

i'm trying to retrieve each item (composed of an image, a word and its translation) from this page

Link of the website: enter image description here

I used JsDom and Got. Here is the code


const jsdom = require("jsdom");
const { JSDOM } = jsdom;
const got = require('got');


(async () => {
    const response = await got("https://livingdictionaries.app/hazaragi/entries/gallery?entries_prod[toggle][hasImage]=true");

    console.log(response.body);
    const dom = new JSDOM(response.body);
    console.log(dom.window.document.querySelectorAll(".ld-egdn1r"))
})();

when I display the html code that is returned to me it does not correspond to what I open the site with my browser.There are no html tags that contain the items.

When I look at the Network tab, other resources are loaded, but again I can't find the query that retrieves the words.

enter image description here

I think that what I am looking for is loaded in several queries but I don't know which one

CodePudding user response:

The site you are trying to scrape is a Single Page Application (SPA) built with Svelte and the individual elements are dynamically rendered as needed, as many websites are today. Since the HTML is not hard-coded, these sites are notoriously difficult to scrape.

If you just log the response, you will see that the elements for which you are selecting do not exist. This is because it is the browser that interprets the JavaScript at run time and updates the UI. A GET request using got, axios, fetch, whatever, cannot perform such tasks.

You will need to implement the use of a headless browser like Puppeteer in order to dynamically render the site and scrape.

  • Related