Assuming I have the following HTML snippet:
Deep Work
<p>
<a data-href="Deep Work" href="Deep Work"
target="_blank" rel="noopener">Deep Work</a>
</p>
Deep Work
<a href="blabla">Some other text</a>
Which regular expression will only match the two "Deep Work" text snippets which are located completely outside of the a-blocks? So, only the ones marked as yellow in this screenshot (not the red ones):
I tried multiple approaches, but always ended up getting a match for the last red one. Which I need to avoid. Thus I would appreciate any help from the community. Thanks!
Update: Unfortunately I simplified the HTML code above too much, using line breaks, to get it readable in StackOverflow. Here is better use case:
<p><a data-href="Deep Work" href="Deep Work" target="_blank" rel="noopener">Deep Work</a> Deep Work <a data-href="Deep Work" href="Deep Work" target="_blank" rel="noopener">Deep Work</a> Deep Work </p>
Again only the two "Deep Work" mentions outside any A-block should be found by the RegExp.
CodePudding user response:
^Deep Work
matches only yellow ones
let text = `<p><a data-href="Deep Work" href="Deep Work" target="_blank" rel="noopener">Deep Work</a> Deep Work <a data-href="Deep Work" href="Deep Work" target="_blank" rel="noopener">Deep Work</a> Deep Work </p>`;
console.log(text.match(/(?<=\s)Deep Work(?=\s)/gm));
Where
- g: matches the pattern multiple times.
- m: enables multi-line mode.
CodePudding user response:
»Again only the two "Deep Work" mentions outside any A-block should be found by the RegExp.«
Since the OP's example clearly shows, that the OP wants to match any node value (or text content) from just the first-level text-nodes, a solution based on DOMParser.parseFromString
might look similar to the next provided example code ...
const sampleMarkup =
`Deep Work 1
<p>
<a data-href="Deep Work" href="Deep Work"
target="_blank" rel="noopener">Deep Work</a>
</p>
Deep Work 2
<a href="blabla">Some other text</a>`;
console.log(
Array
// make an array from ...
.from(
(new DOMParser)
// ... a parsed document ...
.parseFromString(sampleMarkup, 'text/html')
// ... body's ...
.querySelector('body')
// ... child nodes ...
.childNodes
)
// ... and filter just the first level text nodes in order to ...
.filter(node => node.nodeType === 3)
// ... retrieve each matching text node's sanitized/trimmed text content.
.map(node => node.nodeValue.trim())
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
From the above comments ...
»The last edit, which provides the one-liner markup, implicitly changes the requirements for it is not equal to the before provided formatted html code. There is a difference in matching just all first level text node values (formatted code example) and matching any text node value which is not part of an
<a/>
element (the one-liner markup).«
... but as already mentioned ...
»The task described by the OP is nothing that should be solved by regex (nor can a pure regex based approach assure 100% reliability on that matter). The OP should consider a
DOMParser
based approach.«
... due to a reliable approach, the refactoring can be achieved pretty easily ...
// `<p>
// <a data-href="Deep Work" href="Deep Work" target="_blank" rel="noopener">
// Deep Work
// </a>
// Deep Work 1
// <a data-href="Deep Work" href="Deep Work" target="_blank" rel="noopener">
// Deep Work
// </a>
// Deep Work 2
// </p>`
const sampleMarkup = '<p><a data-href="Deep Work" href="Deep Work" target="_blank" rel="noopener">Deep Work</a>Deep Work 1<a data-href="Deep Work" href="Deep Work" target="_blank" rel="noopener">Deep Work</a>Deep Work 2</p>';
function collectTextNodes(textNodeList, node) {
const nodeType = node?.nodeType;
if (nodeType === 1) {
[...node.childNodes]
.reduce(collectTextNodes, textNodeList);
} else if (nodeType === 3) {
textNodeList.push(node);
}
return textNodeList;
}
console.log(
// ... collect any text node from within ...
collectTextNodes(
[],
(new DOMParser)
// ... a parsed document's ...
.parseFromString(sampleMarkup, 'text/html')
// ... body ...
.querySelector('body')
)
// ... and filter any text node which is not located within an `<a/>` element ...
.filter(textNode =>
textNode.parentNode.closest('a') === null
)
// ... and retrieve each matching text node's sanitized/trimmed text content.
.map(node =>
node.nodeValue.trim()
)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }