Home > Net >  Save data in corresponding related documents, rather in separated arrays | puppeteer & node.js
Save data in corresponding related documents, rather in separated arrays | puppeteer & node.js

Time:10-01

Let's say I scrape a simple website with Puppeteer, like this one:

<h1>BIG HEADER</h1>
<h2 >a Header</h2>
<p>some content</p>
<br>
<h2 >a Header</h2>
<p>some content</p>
<br>
<h2 >a Header</h2>
<p>some content</p>
<h3 >any header</h3>
<p>some more content</p>
<h3 >any header</h3>
<p>some more content</p>
<h3 >any header</h3>
<p>some more content</p>
<br>

My goal is it to save the retrieved data into its own document within a mongodb collection. By that, each document will have its header (h2) the corresponding content (p).

I'm able to scrape the whole page for h2 or for p-tags, but they are not connected to each other.

const pageTitle = await page.title() // blogpost-title
const headers = await page.$$eval( 'h2', header => {
        return header.map( h => h.textContent )
    })
const pageContent = await page.$$eval( 'p', p => {
        return p.map( el => el.textContent )
    })

The result should be something like that (collection = blogpost title):

[
{
title: "blogpost-title",
header: "a header",
content: "some content"
},
{
title: "blogpost-title",
header: "a header",
content: "some content"
}
]

Can anyone help me?

CodePudding user response:

I'm not totally sure what you're trying to achieve, but here's code that gives you the <h2>s and the <p> that immediately follows.

const puppeteer = require("puppeteer"); // ^18.0.4

const html = `<!DOCTYPE html>
<html>
<body>
<h1>BIG HEADER</h1>
<h2 >a Header</h2>
<p>some content</p>
<br>
<h2 >a Header</h2>
<p>some content</p>
<br>
<h2 >a Header</h2>
<p>some content</p>
<h3 >any header</h3>
<p>some more content</p>
<h3 >any header</h3>
<p>some more content</p>
<h3 >any header</h3>
<p>some more content</p>
<br>
</body>
</html>`;

let browser;
(async () => {
  browser = await puppeteer.launch();
  const [page] = await browser.pages();
  await page.setContent(html);
  const result = await page.$$eval(".header", els =>
    els.map(e => ({
      header: e.textContent,
      content: e.nextElementSibling.textContent,
    }))
  );
  console.log(result);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close())
;

Output:

[
  { header: 'a Header', content: 'some content' },
  { header: 'a Header', content: 'some content' },
  { header: 'a Header', content: 'some content' }
]

If your content is supposed to contain all tags until the next header, try:

const result = await page.$$eval(".header", els =>
  els.map(e => {
    const content = [];
    const {textContent: header} = e;

    while ((e = e.nextElementSibling) && e.tagName !== "H2") {
      if (e.textContent) {
        content.push(e.textContent);
      }
    }

    return {header, content};
  })
);

Output:

[
  { header: 'a Header', content: [ 'some content', '' ] },
  { header: 'a Header', content: [ 'some content', '' ] },
  {
    header: 'a Header',
    content: [
      'some content',
      'any header',
      'some more content',
      'any header',
      'some more content',
      'any header',
      'some more content'
    ]
  }
]

If your content is supposed to contain only <p> contents until the next header, try:

const result = await page.$$eval(".header", els =>
  els.map(e => {
    const content = [];
    const {textContent: header} = e;

    while ((e = e.nextElementSibling) && e.tagName !== "H2") {
      if (e.tagName === "P") {
        content.push(e.textContent);
      }
    }

    return {header, content};
  })
);

Output:

[
  { header: 'a Header', content: [ 'some content' ] },
  { header: 'a Header', content: [ 'some content' ] },
  {
    header: 'a Header',
    content: [
      'some content',
      'some more content',
      'some more content',
      'some more content'
    ]
  }
]

If your headers accept both h2 and h3, try

const result = await page.$$eval(".header, .small-header", els => //...

and proceed using either the "first <p>" logic, the "<p>-only" logic above, or adjust the "grab everything until the next header" logic accordingly:

const result = await page.$$eval(".header, .small-header", els =>
  els.map(e => {
    const content = [];
    const {textContent: header} = e;

    while (
      (e = e.nextElementSibling) &&
      !e.classList.contains("header") &&
      !e.classList.contains("small-header")
    ) {
      if (e.tagName === "P") {
        content.push(e.textContent);
      }
    }

    return {header, content};
  })
);

Output:

[
  { header: 'a Header', content: [ 'some content' ] },
  { header: 'a Header', content: [ 'some content' ] },
  { header: 'a Header', content: [ 'some content' ] },
  { header: 'any header', content: [ 'some more content' ] },
  { header: 'any header', content: [ 'some more content' ] },
  { header: 'any header', content: [ 'some more content' ] }
]

You can always join the content arrays into strings if you want by adjusting the return {header, content} to return {header, content: content.join("\n")}, or if you're sure there's always one <p> right after the header, then the nextSibling without the loop logic seems easiest--no arrays needed.

If you want the title added to each element, use:

return {title: document.title, header, content};
  • Related