Home > Software design >  web scraping for html page but need for repeat on lots link?
web scraping for html page but need for repeat on lots link?

Time:07-11

I wrote the following code for parse some part of HTML for one URL. I means parse page const URL= 'https://www.example.com/1'

Now I want to parse the next page 'https://www.example.com/2' and so on. so I want to implement a For-Loop manner here.

what is the easiest way that I can use the iteration manner here to change URL (cover page 1,2,3, ...) automatically and run this code in repeat to parse other pages? How I can use for-loop manner here?

const PORT = 8000
const axios = require('axios')
const cheerio = require('cheerio')
const express = require('express')
const app = express()
const cors = require('cors')
app.use(cors())

const url = 'https://www.example.com/1'

app.get('/', function (req, res) {
    res.json('This is my parser')
})

app.get('/results', (req, res) => {
    axios(url)
        .then(response => {
            const html = response.data
            const $ = cheerio.load(html)
            const articles = []

            $('.fc-item__title', html).each(function () { 
                const title = $(this).text()
                const url = $(this).find('a').attr('href')
                articles.push({
                    title,
                    url
                })
            })
            res.json(articles)
        }).catch(err => console.log(err))

})


app.listen(PORT, () => console.log(`server running on PORT ${PORT}`))

CodePudding user response:

Some considerations, if you added CORS to your app, so that you can GET the data, it's useless, you add CORS when you want to SEND data, when your app is going to receive requests, CORS enable other people to use your app, it's useless then trying to use other people's app. And CORS problems happen only in the browser, as node is on the server, it will never get CORS error.

The first problem with your code, is that https://www.example.com/1, even working on the browser, returns 404 Not Found Error to axios, because this page really doesn't exist, only https://www.example.com would work.

I added an example using the comic site enter image description here

const PORT = 8000;
const axios = require("axios");
const cheerio = require("cheerio");
const express = require("express");
const app = express();

const url = "https://xkcd.com/";

app.get("/", function (req, res) {
  res.json("This is my parser");
});

let pagesToScrap = 50;

app.get("/results", (req, res) => {
  const promisesArray = [];

  for (let pageNumber = 1; pageNumber <= pagesToScrap; pageNumber  ) {
    let promise = new Promise((resolve, reject) => {
      axios(url   pageNumber)
        .then((response) => {
          const $ = cheerio.load(response.data);
          let result = $("#transcript").prev().html();
          resolve(result);
        })
        .catch((err) => reject(err));
    });
    promisesArray.push(promise);
  }
  Promise.all(promisesArray)
    .then((result) => res.json(result))
    .catch((err) => {
      res.json(err);
    });
});

app.listen(PORT, () => console.log(`server running on PORT ${PORT}`));
  • Related