Find all span tags in a string using Javascript-CodePudding

I have a piece of text similar to this and it is basically a string of HTML code.

hello
<span dir="auto" >Professional Referee</span>
<div>....</div>
<span dir="auto" >Professional Referee</span>
<div>....</div>
<div>
<span dir="auto" >Professional Referee</span>
</div>
<span dir="auto" >Professional Referee</span>
<span dir="auto" >Professional Referee</span>

What I would like is to capture all of the span tags innerText (so in the example below, it would be Professional Referee) and store the results in an array.

The Regex - I am thinking this would be the way to go - I have is like this:

^/(<span)([\a-zA-Z0-9\s]*)(<\/span>)/$

I am not flash on regex, and the additional issues is that each span tag may have some attributes that are not equal to the other tags.

I think if I can get the full span tags from here in an array then I can manage to remove the left over stuff.

I got a regex101 link here: https://regex101.com/r/9K90pa/1

Can someone help me select on the right way?

CodePudding user response：

Regex is not the ideal tool for analysing HTML. The DOM API offers a DOM Parser:

const html = `hello
<span dir="auto" >Professional Referee</span>
<div>....</div>
<span dir="auto" >Professional Referee</span>
<div>....</div>
<div>
<span dir="auto" >Professional Referee</span>
</div>
<span dir="auto" >Professional Referee</span>
<span dir="auto" >Professional Referee</span>`;

const doc = new DOMParser().parseFromString(html, "text/html");
const spanTexts = Array.from(doc.querySelectorAll("span"), span => span.textContent);

console.log(spanTexts);

CodePudding user response：

kind of a bad solution, i got the regex but i am not flash in js

const regexp = "<span.*?>(.*?)<\/span>";

const html = `hello
<span dir="auto" >Professional Referee</span>
<div>....</div>
<span dir="auto" >Professional Referee</span>
<div>....</div>
<div>
<span dir="auto" >Professional Referee</span>
</div>
<span dir="auto" >Professional Referee</span>
<span dir="auto" >Professional Referee</span>`;
const array = [...html.matchAll(regexp)];

console.log(array);

this out puts a 2d array with 2nd item of each array as the innerText:

> Array [Array ["<span dir="auto" >Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" >Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" >Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" >Professional Referee</span>", "Professional Referee"], Array ["<span dir="auto" >Professional Referee</span>", "Professional Referee"]]

this would lead to more problems where the span closing tag is on another line DOMParser is much better.