I want to extract the information from a table using puppeteer
and NodeJS. But I need help getting the link from a table cell. The table has no class names or IDs.
This is the closest I've gotten:
url: e.getElementsByTagName("td")[3].innerHTML
This gives me the following:
{
cellText: 'AFC',
url: '<a href="/wiki/Asian_Football_Confederation" title="Asian Football Confederation">AFC</a>'
},
{ cellText: '', url: '' }
Do you know how I can get this below?:
{
cellText: 'AFC',
url: "/wiki/Asian_Football_Confederation"
},
This is the code with a random website:
const pupperteer = require("puppeteer");
async function run() {
const browser = await pupperteer.launch();
const page = await browser.newPage();
await page.goto("https://en.m.wikipedia.org/wiki/2022_FIFA_World_Cup_Group_A")
const myArray = await page.$$eval("table[class*='sortable'", (elements) =>
elements.map((e) => ({
cellText: e.getElementsByTagName("td")[3].innerText,
url: e.getElementsByTagName("td")[3].innerHTML
}))
);
console.log(myArray);
await browser.close();
}
run();
CodePudding user response:
Assuming you want to select a single element, I'd avoid getElementsByTagName
here in favor of the one-shot table[class*="sortable"] td:nth-child(4)
. This selects the table you're targeting, then grabs the fourth td
from the first row of data cells.
To get the href, add a second query based on the nodes inside the above-selected cell element: element.querySelector("a")
.
Putting it together:
const puppeteer = require("puppeteer"); // ^19.0.0
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
const url = "https://en.m.wikipedia.org/wiki/2022_FIFA_World_Cup_Group_A";
await page.goto(url, {waitUntil: "domcontentloaded"});
const sel = 'table[class*="sortable"] td:nth-child(4)';
const result = await page.$eval(sel, e => ({
cellText: e.textContent,
url: e.querySelector("a").href
}));
console.log(result);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Considering the text content and link are one and the same in this case, you can simplify this further to a single selector for only the anchor tag within the target <td>
:
// ...
const sel = 'table[class*="sortable"] td:nth-child(4) a';
const result = await page.$eval(sel, e => ({
cellText: e.textContent,
url: e.href
}));
// ...
Since the data is statically present in the HTML, you don't need Puppeteer for this. It's probably better to use a lightweight HTTP request and HTML parser like Cheerio.