I am dealing with dirty text data (and not with valid html). I am doing natural language processing and short code snippets shouldn't be removed because they can contain valuable information while long code snippets don't.
Thats why I would like to remove text between code tags only if the content that will be removed has character length > n
.
Let's say the number of allowed characters between two code tags is n <= 5
. Then everything between those tags that is longer than 5 characters will be removed.
My approach so far deletes all of the code characters:
text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
text = re.sub("<code>.*?</code>", '', text)
print(text)
Output: This is a string another string another string another string.
The desired output:
"This is a string <code>1234</code> another string <code>123</code> another string another string."
Is there a way to count the text length for all of the appearing <code ... </code>
tags before it will actually be removed?
CodePudding user response:
In Python, BeautifulSoup is often used to manipulate HTML/XML contents. If you use this library, you can use something like
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"html.parser")
text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
soup = BeautifulSoup(text,"html.parser")
for code in soup.find_all("code"):
if len(code.encode_contents()) > 5: # Check the inner HTML length
code.extract() # Remove the node found
print(str(soup))
# => This is a string <code>1234</code> another string <code>123</code> another string another string.
Note that here, the length of the inner HTML part is taken into account, not the inner text.
With regex, you can use a negated character class pattern, [^<]
, to match any char other than <
, and apply a limiting quantifier to it. If all longer than 5 chars should be removed, use {6,}
quantifier:
import re
text = "This is a string <code>1234</code> another string <code>123</code> another string <code>123456789</code> another string."
text = re.sub(r'<code>[^>]{6,}</code>', '', text)
print(text)
# => This is a string <code>1234</code> another string <code>123</code> another string another string.
See this Python demo.