df column replace with space in anything that falls in a symbol-CodePudding

I have a df column which has a lot of tags. I want to clean them out.

df['text_col'].tolist()


['n</li></ul>',
 '<p> bla bla bla </p>',
 'bla bla </b>, </li>, <li>, </ul>',
 'bla bla <strong>bla </strong>: <h3> </h3>, <ul>,<b> </p>']

I see two ways of cleaning it.

Create a list of all tags I find in the text and then replace those with the empty string '' (can be laborious task to maintain the list)
Some logic to remove anything that comes in < and > tags.

I dont know of any other way other than str replace.. but it doesnt quite do what I explained above.

df["text_col"].str.replace("</p>"," ")

How do I remove all the tags and clean the text_col?

CodePudding user response：

After a little bit of looking around this is what I found:

import re

x=['n</li></ul>',
 '<p> bla bla bla </p>',
 'bla bla </b>, </li>, <li>, </ul>',
 'bla bla <strong>bla </strong>: <h3> </h3>, <ul>,<b> </p>']

for item in x:
    item = re.sub("<.*?>|,|:", "", item)
    item=' '.join(item.split())
    print(item)

Outputs:

n
bla bla bla
bla bla
bla bla bla

I edited my answer again to refine it a little more. This should definitely answer your question. Thank regex :) .