Home > Enterprise >  Scrape multiple domains with axios, cheerio and handlebars on node js
Scrape multiple domains with axios, cheerio and handlebars on node js

Time:01-23

I am trying to make a webscraper, that outputs certain data from node js into the javascript, or html file im working on. Its important that the data of multiple sub pages can be scraped (that I have no code access to) and be displayed in the same html or js file. The problem is that I cant output the results I get from the axios function into global. If i could my problem would be solved.

So far I have been trying to use axios to get the data I need and cheerio to modify it. I created a const named "articles" where I pushed in every title I needed from the website im scraping.

const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const hbs = require('hbs')


const url = 'https://www.google.com/'
const articles = []

axios(url)
    .then(response => {
        const html = response.data
        const $ = cheerio.load(html)
        
        $('.sprite', html).parent().children('a').each(function() {
            const text = $(this).attr('title')
        
            articles.push({
                text
            })       
        })
         console.log(articles)
     
        const finalArray = articles.map(a => a.text);
        console.log(finalArray)
        
    }).catch(err => console.log(err))

That works well so far. If I ouput the finalArray I get the array I want to. But once im outside of the axios function the array is empty. Only way it worked for me is when I put the following code inside the axios function, but in this case I wont be able to scrape multiple websides.

console.log(finalArray) //outputs empty array

// with this function I want to get the array displayed in my home.hbs file.
 app.get('/', function(req, res){
            res.render('views/home', {
               array: finalArray
            })
        })

Basicly all I need is to get the finalArray into global so I can use it in the app.get function to render the Website with the scraped data.

CodePudding user response:

There are two cases here. Either you want to re-run your scraping code on each request, or you want to run the scraping code once when the app starts and re-use the cached result.

New request per request:

const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");

const scrape = () =>
  axios
    .get("https://www.example.com")
    .then(({data}) => cheerio.load(data)("h1").text());

express()
  .get("/", (req, res) => {
    scrape().then(text => res.json({text}));
  })
  .listen(3000);

Up-front, one-off request:

const scrapingResultP = axios
  .get("https://www.example.com")
  .then(({data}) => cheerio.load(data)("h1").text());

express()
  .get("/", (req, res) => {
    scrapingResultP.then(text => res.json({text}));
  })
  .listen(3000);

Result:

$ curl localhost:3000
{"text":"Example Domain"}

It's also possible to do a one-off request without a callback or promise that uses a race condition to populate a variable in scope of the request handlers as well as the scraping response handler. Realistically, the server should be up by the time the request resolves, though, so it's common to see this:

let result;
axios
  .get("https://www.example.com")
  .then(({data}) => (result = cheerio.load(data)("h1").text()));

express()
  .get("/", (req, res) => {
    res.json({text: result});
  })
  .listen(3000);

Eliminating the race by chaining your Express routes and listener from the axios response handler:

axios.get("https://www.example.com").then(({data}) => {
  const text = cheerio.load(data)("h1").text();
  express()
    .get("/", (req, res) => {
      res.json({text});
    })
    .listen(3000);
});

If you have multiple requests you need to complete before you start the server, try Promise.all. Top-level await or an async IIFE can work too.

Error handling has been left as an exercise.

CodePudding user response:

Problem has been resolved. I used this code, instead of the normal axios.get(url) function:

axios.all(urls.map((endpoint) => axios.get(endpoint))).then(

axios.spread(({data:user}, {data:repos}) => {

with "user", and "repos" I am now able to enter both URL data and can execute code regarding the URL i like to chose in that one function.

  • Related