I have a piece of text similar to this and it is basically a string of HTML code.
hello
<span dir="auto" >Professional Referee</span>
<div>....</div>
<span dir="auto" >Professional Referee</span>
<div>....</div>
<div>
<span dir="auto" >Professional Referee</span>
</div>
<span dir="auto" >Professional Referee</span>
<span dir="auto" >Professional Referee</span>
What I would like is to capture all of the span tags innerText (so in the example below, it would be Professional Referee) and store the results in an array.
The Regex - I am thinking this would be the way to go - I have is like this:
^/(<span)([\a-zA-Z0-9\s]*)(<\/span>)/$
I am not flash on regex, and the additional issues is that each span tag may have some attributes that are not equal to the other tags.
I think if I can get the full span tags from here in an array then I can manage to remove the left over stuff.
I got a regex101 link here: https://regex101.com/r/9K90pa/1
Can someone help me select on the right way?
CodePudding user response:
Regex is not the ideal tool for analysing HTML. The DOM API offers a DOM Parser:
const html = `hello
<span dir="auto" >Professional Referee</span>
<div>....</div>
<span dir="auto" >Professional Referee</span>
<div>....</div>
<div>
<span dir="auto" >Professional Referee</span>
</div>
<span dir="auto" >Professional Referee</span>
<span dir="auto" >Professional Referee</span>`;
const doc = new DOMParser().parseFromString(html, "text/html");
const spanTexts = Array.from(doc.querySelectorAll("span"), span => span.textContent);
console.log(spanTexts);
CodePudding user response:
kind of a bad solution, i got the regex but i am not flash in js
const regexp = "<span.*?>(.*?)<\/span>";
const html = `hello
<span dir="auto" >Professional Referee</span>
<div>....</div>
<span dir="auto" >Professional Referee</span>
<div>....</div>
<div>
<span dir="auto" >Professional Referee</span>
</div>
<span dir="auto" >Professional Referee</span>
<span dir="auto" >Professional Referee</span>`;
const array = [...html.matchAll(regexp)];
console.log(array);
this out puts a 2d array with 2nd item of each array as the innerText:
> Array [Array ["<span dir="auto" >Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" >Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" >Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" >Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" >Professional Referee</span>", "Professional Referee"]]
this would lead to more problems where the span closing tag is on another line DOMParser is much better.