Let's say I scrape a simple website with Puppeteer, like this one:
<h1>BIG HEADER</h1>
<h2 >a Header</h2>
<p>some content</p>
<br>
<h2 >a Header</h2>
<p>some content</p>
<br>
<h2 >a Header</h2>
<p>some content</p>
<h3 >any header</h3>
<p>some more content</p>
<h3 >any header</h3>
<p>some more content</p>
<h3 >any header</h3>
<p>some more content</p>
<br>
My goal is it to save the retrieved data into its own document within a mongodb collection. By that, each document will have its header (h2) the corresponding content (p).
I'm able to scrape the whole page for h2 or for p-tags, but they are not connected to each other.
const pageTitle = await page.title() // blogpost-title
const headers = await page.$$eval( 'h2', header => {
return header.map( h => h.textContent )
})
const pageContent = await page.$$eval( 'p', p => {
return p.map( el => el.textContent )
})
The result should be something like that (collection = blogpost title):
[
{
title: "blogpost-title",
header: "a header",
content: "some content"
},
{
title: "blogpost-title",
header: "a header",
content: "some content"
}
]
Can anyone help me?
CodePudding user response:
I'm not totally sure what you're trying to achieve, but here's code that gives you the <h2>
s and the <p>
that immediately follows.
const puppeteer = require("puppeteer"); // ^18.0.4
const html = `<!DOCTYPE html>
<html>
<body>
<h1>BIG HEADER</h1>
<h2 >a Header</h2>
<p>some content</p>
<br>
<h2 >a Header</h2>
<p>some content</p>
<br>
<h2 >a Header</h2>
<p>some content</p>
<h3 >any header</h3>
<p>some more content</p>
<h3 >any header</h3>
<p>some more content</p>
<h3 >any header</h3>
<p>some more content</p>
<br>
</body>
</html>`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
const result = await page.$$eval(".header", els =>
els.map(e => ({
header: e.textContent,
content: e.nextElementSibling.textContent,
}))
);
console.log(result);
})()
.catch(err => console.error(err))
.finally(() => browser?.close())
;
Output:
[
{ header: 'a Header', content: 'some content' },
{ header: 'a Header', content: 'some content' },
{ header: 'a Header', content: 'some content' }
]
If your content
is supposed to contain all tags until the next header, try:
const result = await page.$$eval(".header", els =>
els.map(e => {
const content = [];
const {textContent: header} = e;
while ((e = e.nextElementSibling) && e.tagName !== "H2") {
if (e.textContent) {
content.push(e.textContent);
}
}
return {header, content};
})
);
Output:
[
{ header: 'a Header', content: [ 'some content', '' ] },
{ header: 'a Header', content: [ 'some content', '' ] },
{
header: 'a Header',
content: [
'some content',
'any header',
'some more content',
'any header',
'some more content',
'any header',
'some more content'
]
}
]
If your content
is supposed to contain only <p>
contents until the next header, try:
const result = await page.$$eval(".header", els =>
els.map(e => {
const content = [];
const {textContent: header} = e;
while ((e = e.nextElementSibling) && e.tagName !== "H2") {
if (e.tagName === "P") {
content.push(e.textContent);
}
}
return {header, content};
})
);
Output:
[
{ header: 'a Header', content: [ 'some content' ] },
{ header: 'a Header', content: [ 'some content' ] },
{
header: 'a Header',
content: [
'some content',
'some more content',
'some more content',
'some more content'
]
}
]
If your header
s accept both h2 and h3, try
const result = await page.$$eval(".header, .small-header", els => //...
and proceed using either the "first <p>
" logic, the "<p>
-only" logic above, or adjust the "grab everything until the next header" logic accordingly:
const result = await page.$$eval(".header, .small-header", els =>
els.map(e => {
const content = [];
const {textContent: header} = e;
while (
(e = e.nextElementSibling) &&
!e.classList.contains("header") &&
!e.classList.contains("small-header")
) {
if (e.tagName === "P") {
content.push(e.textContent);
}
}
return {header, content};
})
);
Output:
[
{ header: 'a Header', content: [ 'some content' ] },
{ header: 'a Header', content: [ 'some content' ] },
{ header: 'a Header', content: [ 'some content' ] },
{ header: 'any header', content: [ 'some more content' ] },
{ header: 'any header', content: [ 'some more content' ] },
{ header: 'any header', content: [ 'some more content' ] }
]
You can always join the content arrays into strings if you want by adjusting the return {header, content}
to return {header, content: content.join("\n")}
, or if you're sure there's always one <p>
right after the header, then the nextSibling
without the loop logic seems easiest--no arrays needed.
If you want the title added to each element, use:
return {title: document.title, header, content};