Removing Specific Span Tags from a CSV file-CodePudding

I am trying to remove specific span tags from a csv file but my code is deleting all of them, I just need to point out certain ones to be removed for example ''. But some have '' or '' and or  that bolds the text like name<\STRONG> that I need to keep but I want to remove the font family and font-size like stated above how would I go about doing this if there is a solution.

Thank you

`import re

CLEANR = re.compile('<.*?>')


def cleanhtml(raw_html):
    cleantext = re.sub(CLEANR, '', raw_html)
    return cleantext


a_file = open("file.csv", 'r')

lines = a_file.readlines()
a_file.close()

newfile = open("file2.csv", 'w')
for line in lines:
    line = cleanhtml(line)
    newfile.write(line)
newfile.close()`

CodePudding user response：

The issue with your current code is that the regular expression CLEANR = re.compile('<.*?>') is matching any tag that starts with a < and ends with a >, regardless of the contents of the tag. You can use the | operator to match multiple different patterns to remove specific span tags while keeping others.

Remove only span tags with the specific font-family and font-size:

CLEANR = re.compile('<span style="font-family: verdana,geneva; font-size: 10pt;">.*?</span>')

Remove multiple span tags with different attributes:

CLEANR = re.compile('<span style="font-family: verdana,geneva; font-size: 10pt;">.*?</span>|<span style="color: red;">.*?</span>')

Match the span tags with certain attributes while keeping the ones with different attributes:

CLEANR = re.compile('<span (?=style="font-family: verdana,geneva; font-size: 10pt;").*?</span>')

CodePudding user response：

If your input is always HTML string, then you could use BeautifulSoup.

Here is an example:

from bs4 import BeautifulSoup

doc = '''<span style="font-family: verdana,geneva; font-size: 10pt;"><b>xyz</b></span>'''
soup = BeautifulSoup(doc, "html.parser")
for tag in soup.recursiveChildGenerator():
    try:
        result = dict(filter(lambda elem: 'font-family' not in elem[1] and 'font-size' not in elem[1], tag.attrs.items()))
        tag.attrs = result
    except AttributeError:
        pass
print(soup)

The output:

<span><b>xyz</b></span>

So you can use this in your code like,

from bs4 import BeautifulSoup

def cleanhtml(raw_html):
    soup = BeautifulSoup(raw_html, "html.parser")
    for tag in soup.recursiveChildGenerator():
        try:
            result = dict(filter(lambda elem: 'font-family' not in elem[1] and 'font-size' not in elem[1], tag.attrs.items()))
            tag.attrs = result
        except AttributeError:
            pass
    return str(soup) #return as HTML string