Let's say I have such a simplified website I want to scrape:
<section >
<p></p>
<figure></figure>
<p></p>
<div></div>
<p></p>
<h2></h2>
<p></p>
<p></p>
<h2></h2>
<p></p>
<h3></h3>
<p></p>
<h3></h3>
<p></p>
<p></p>
<h3></h3>
<p></p>
</section>
EDIT There are other tags, which I don't want to scrape, between the first p-Tag and the h2-tag.
I can scrape all the h2s and h3s by this lines of code:
const pageContent = await page.$$eval( headerSelector, elements => elements.map( element => {
const masterHeader = ''
const { textContent: header } = element
const content = []
const subHeader = []
while ( ( element = element.nextElementSibling ) && element.tagName !== 'H2' ) {
switch (element.tagName) {
case 'P':
content.push( element.textContent )
break;
case 'H3':
subHeader.push( element.textContent )
default:
break;
}
}
return { masterHeader, header, content, subHeader}
}))
let subContent = await page.$$eval( subHeaderSelector, elements => elements.map( element => {
const masterHeader = ''
const {textContent: header } = element
const content = []
const subHeader = []
while ( ( element = element.nextElementSibling ) && element.tagName !== 'H3' && element.tagName !== 'H2' ) {
switch (element.tagName) {
case 'P':
content.push( element.textContent )
break
case 'H2':
masterHeader.push( element.textContent )
break
case 'H4':
subHeader.push( element.textContent )
default:
break
}
}
return { masterHeader, header, content, subHeader}
}))
Once I have extracted the data, and after I did some data manipulation, I concat the content in one variable: "content".
let content = pageContent.concat(subContent)
Now I figured out, that one part is missing. The first parts of the webpage. There are three paragraphs, which I do not scrape with the logic about ( btw the logic refers to the answers here: Save extracted data in objects)
However, I came up with the idea, to target the section and then the children of it. This does basically work, but i don't know how to set an end. Like in this case: I only want the first paragraphs because they don't have any h2 or h3-tag. From that part, where the page delivers a h2-tag, either logic from above will do its job.
const headlessSelector = 'section.entry-content'
const headlessContent = await page.evaluate((selector) => {
const masterHeader = ''
const header = `About ${keyword}`
const content = []
const subHeader = []
for (element of document.querySelector(selector).children) {
switch (element.tagName) {
case 'P':
content.push( element.textContent )
break
default:
break
}
}
return { masterHeader, header, content, subHeader}
}, headlessSelector)
console.log(headlessContent)
Maybe I am kinda over thinking this, but can anyone help please.