Home > Software engineering >  NodeJS, Cheerio. How to find text without knowing selectors?
NodeJS, Cheerio. How to find text without knowing selectors?

Time:05-16

I'm trying to find a specific text. In my case, I have no idea of selectors, elements, parents, or anything else in the HTML code of the target. Just trying to find out if this page has robots.txt. Doing that by searching for 'User-agent:'.

Is someone who knows how to search for a specific text in the parse, without knowing any other piece of information on the page?

    getApiTest = async () => {
    axios.get('http://webilizerr.com/robots.txt')
        .then(res => {
            const $ = cheerio.load(res.data)
            console.log($(this).text().trim() === 'User-agent:'
            )
        }).catch(err => console.error(err))
};

Thanks for your time.

CodePudding user response:

You can simply use a regular expression to check whether "User-agent" is part of the returned HTML.

Be aware: If the scraped page doesn't have a robots.txt file and returns a 404 status code, which should normally be the case, axios throws an error. You should consider this in your catch statement.

Following a working example:

const axios = require("axios");
const cheerio = require("cheerio");

const getApiTest = async () => {
  try {
    const res = await axios.get("https://www.finger.digital/robots.txt");
    const $ = cheerio.load(res.data);
    const userAgentRegExp = new RegExp(/User-agent/g);
    const userAgentRegExpResult = userAgentRegExp.exec($.text());
    if (!userAgentRegExpResult) {
      console.log("Doesn't have robots.txt");
      return;
    }
    console.log("Has robots.txt");
  } catch (error) {
    console.error(error);
    console.log("Doesn't have robots.txt");
  }
};

getApiTest();

  • Related