Home > Blockchain >  How to get href attribute in puppeteer Node.js
How to get href attribute in puppeteer Node.js

Time:11-19

I want to extract the information from a table using puppeteer and NodeJS. But I need help getting the link from a table cell. The table has no class names or IDs. This is the closest I've gotten:

url: e.getElementsByTagName("td")[3].innerHTML

This gives me the following:

{
    cellText: 'AFC',
    url: '<a href="/wiki/Asian_Football_Confederation" title="Asian Football Confederation">AFC</a>'
  },
  { cellText: '', url: '' }

Do you know how I can get this below?:

{
    cellText: 'AFC',
    url: "/wiki/Asian_Football_Confederation"
  },

This is the code with a random website:

const pupperteer = require("puppeteer");

async function run() {
    const browser = await pupperteer.launch();
    const page = await browser.newPage();
    await page.goto("https://en.m.wikipedia.org/wiki/2022_FIFA_World_Cup_Group_A")

    const myArray = await page.$$eval("table[class*='sortable'", (elements) =>
        elements.map((e) => ({
            cellText: e.getElementsByTagName("td")[3].innerText,
            url: e.getElementsByTagName("td")[3].innerHTML
        }))
    );

    console.log(myArray);

    await browser.close();
}

run();

CodePudding user response:

Assuming you want to select a single element, I'd avoid getElementsByTagName here in favor of the one-shot table[class*="sortable"] td:nth-child(4). This selects the table you're targeting, then grabs the fourth td from the first row of data cells.

To get the href, add a second query based on the nodes inside the above-selected cell element: element.querySelector("a").

Putting it together:

const puppeteer = require("puppeteer"); // ^19.0.0

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  const url = "https://en.m.wikipedia.org/wiki/2022_FIFA_World_Cup_Group_A";
  await page.goto(url, {waitUntil: "domcontentloaded"});
  const sel = 'table[class*="sortable"] td:nth-child(4)';
  const result = await page.$eval(sel, e => ({
    cellText: e.textContent,
    url: e.querySelector("a").href
  }));
  console.log(result);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Considering the text content and link are one and the same in this case, you can simplify this further to a single selector for only the anchor tag within the target <td>:

// ...
const sel = 'table[class*="sortable"] td:nth-child(4) a';
const result = await page.$eval(sel, e => ({
  cellText: e.textContent,
  url: e.href
}));
// ...

Since the data is statically present in the HTML, you don't need Puppeteer for this. It's probably better to use a lightweight HTTP request and HTML parser like Cheerio.

  • Related