Home > Software engineering >  Regex, convert html string to array, issue with languages and and special characters
Regex, convert html string to array, issue with languages and and special characters

Time:06-11

I am not an expert with regex, and have an issue trying to convert a html string to an array of html elements, so the idea was if we get by example:

Sample String:

<p>Welcome to my awesome website for more info <a href="www.myanotherawesomewebsite.com" target="_blank">click here</a></p> 

(which actually can be any possible combination)

So I wanted to get something like :

'<p>', 'Welcome to my awesome website for more info','<a href="www.myanotherawesomewebsite.com" target="_blank">', 'click here','</a>',</p>'

So this could be achieved with the next regex:

/(<[^>] >|[a-zè A-Z0-9] )?/g

So using match function, for testing:

 '<p>Welcome to my awesome website for more info <a href="www.myanotherawesomewebsite.com" target="_blank">click here</a></p>'.match(/(<[^>] >|[a-zè A-Z0-9] )?/g)

and this one works, however there is a problem going on with the languages, for everything apart english works okay, but when I have characters in french, or german, this doesn't work anymore...

The work around was to do something like:

/(<[^>] >|[a-zàâäèéêëîïôœùûüÿçäöüÄÖÜÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇß!#.?”“«» A-Z0-9\-\u00A0] )?/g

which works but not 100%, also, not working at all with things like 'sup' or 'sub', etc...

so my question is... there is a way to improve this? Help and advices will be very welcome. Thank you in advance for reading...

CodePudding user response:

You can simply use [^<] for the non-tag node instead of enumeration of characters.
Also, I don't think you need question mark at the end. It would help only if you had an empty string input.

So the result regexp is /(<[^>] >|[^<] )/g

  • Related