I am trying to remove specific span tags from a csv file but my code is deleting all of them, I just need to point out certain ones to be removed for example '<span style="font-family: verdana,geneva; font-size: 10pt;">'
. But some have '<b>'
or '<p>'
and or <STRONG>
that bolds the text like <STRONG>
name<\STRONG>
that I need to keep but I want to remove the font family and font-size like stated above how would I go about doing this if there is a solution.
Thank you
`import re
CLEANR = re.compile('<.*?>')
def cleanhtml(raw_html):
cleantext = re.sub(CLEANR, '', raw_html)
return cleantext
a_file = open("file.csv", 'r')
lines = a_file.readlines()
a_file.close()
newfile = open("file2.csv", 'w')
for line in lines:
line = cleanhtml(line)
newfile.write(line)
newfile.close()`
CodePudding user response:
The issue with your current code is that the regular expression CLEANR = re.compile('<.*?>')
is matching any tag that starts with a <
and ends with a >
, regardless of the contents of the tag. You can use the |
operator to match multiple different patterns to remove specific span tags while keeping others.
Remove only span tags with the specific font-family and font-size:
CLEANR = re.compile('<span style="font-family: verdana,geneva; font-size: 10pt;">.*?</span>')
Remove multiple span tags with different attributes:
CLEANR = re.compile('<span style="font-family: verdana,geneva; font-size: 10pt;">.*?</span>|<span style="color: red;">.*?</span>')
Match the span tags with certain attributes while keeping the ones with different attributes:
CLEANR = re.compile('<span (?=style="font-family: verdana,geneva; font-size: 10pt;").*?</span>')
CodePudding user response:
If your input is always HTML string, then you could use BeautifulSoup
.
Here is an example:
from bs4 import BeautifulSoup
doc = '''<span style="font-family: verdana,geneva; font-size: 10pt;"><b>xyz</b></span>'''
soup = BeautifulSoup(doc, "html.parser")
for tag in soup.recursiveChildGenerator():
try:
result = dict(filter(lambda elem: 'font-family' not in elem[1] and 'font-size' not in elem[1], tag.attrs.items()))
tag.attrs = result
except AttributeError:
pass
print(soup)
The output:
<span><b>xyz</b></span>
So you can use this in your code like,
from bs4 import BeautifulSoup
def cleanhtml(raw_html):
soup = BeautifulSoup(raw_html, "html.parser")
for tag in soup.recursiveChildGenerator():
try:
result = dict(filter(lambda elem: 'font-family' not in elem[1] and 'font-size' not in elem[1], tag.attrs.items()))
tag.attrs = result
except AttributeError:
pass
return str(soup) #return as HTML string