Home > Software design >  Puppeteer how to target a dynamic range of children elements
Puppeteer how to target a dynamic range of children elements

Time:10-07

Let's say I have such a simplified website I want to scrape:

<section >
    <p></p>
    <figure></figure>
    <p></p>
    <div></div>
    <p></p>
    <h2></h2>
    <p></p>
    <p></p>
    <h2></h2>
    <p></p>
    <h3></h3>
    <p></p>
    <h3></h3>
    <p></p>
    <p></p>
    <h3></h3>
    <p></p>
</section>

EDIT There are other tags, which I don't want to scrape, between the first p-Tag and the h2-tag.

I can scrape all the h2s and h3s by this lines of code:

const pageContent = await page.$$eval( headerSelector, elements => elements.map( element => {
    const masterHeader = ''
    const { textContent: header } = element
    const content = []
    const subHeader = []

    while ( ( element = element.nextElementSibling ) && element.tagName !== 'H2' ) {
        switch (element.tagName) {
            case 'P':
                content.push( element.textContent )
                break;
            case 'H3':
                subHeader.push( element.textContent )
            default:
                break;
        }
    }
    return { masterHeader, header, content, subHeader}
}))


let subContent = await page.$$eval( subHeaderSelector, elements => elements.map( element => {
    const masterHeader = ''
    const {textContent: header } = element
    const content = []
    const subHeader = []

    while ( ( element = element.nextElementSibling ) && element.tagName !== 'H3' && element.tagName !== 'H2' ) {
        switch (element.tagName) {
            case 'P':
                content.push( element.textContent )
                break
            case 'H2':
                masterHeader.push( element.textContent )
                break
            case 'H4':
                subHeader.push( element.textContent )
            default:
                break
        }
    }
    return { masterHeader, header, content, subHeader}
}))

Once I have extracted the data, and after I did some data manipulation, I concat the content in one variable: "content".

let content = pageContent.concat(subContent)

Now I figured out, that one part is missing. The first parts of the webpage. There are three paragraphs, which I do not scrape with the logic about ( btw the logic refers to the answers here: Save extracted data in objects)

However, I came up with the idea, to target the section and then the children of it. This does basically work, but i don't know how to set an end. Like in this case: I only want the first paragraphs because they don't have any h2 or h3-tag. From that part, where the page delivers a h2-tag, either logic from above will do its job.

const headlessSelector = 'section.entry-content'
const headlessContent = await page.evaluate((selector) => {
    const masterHeader = ''
    const header = `About ${keyword}`
    const content = []
    const subHeader = []

    for (element of document.querySelector(selector).children) {
        switch (element.tagName) {
            case 'P':
                content.push( element.textContent )
                break
            default:
                break
        }
    }
    return { masterHeader, header, content, subHeader}
}, headlessSelector)

console.log(headlessContent)

Maybe I am kinda over thinking this, but can anyone help please.

  • Related