Code:
text2=re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.& ]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F])) ', text)
Output:
['https://m.facebook.com/people/Vick-Arcadia/100009629167118/', 'https://m.facebook.com<span', 'https://m.facebook.com<span',
CodePudding user response:
In general regexes aren't powerful enough to handle handle HTML which is tree structured and has matching openers and closers.
The preferred technique is to use a parser designed for HTML. In the Python world, lxml and BeautifulSoup are popular choices.
CodePudding user response:
This regex should work better
'https?:\/\/[\w\.] (\/[\/\w-] )?'
For regex I recommends testing on https://regex101.com/
But in operation on html better use BeautifulSoup lib, if u add more detals I can help u with this.